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Abstract 



We study online learnability of a wide class of problems, extending the results of |25j to general no- 
tions of performance measure well beyond external regret. Our framework simultaneously captures such 
well-known notions as internal and general $-regret, learning with non-additive global cost functions, 
BlackwelPs approachability, calibration of forecasters, adaptive regret, and more. We show that learn- 
ability in all these situations is due to control of the same three quantities: a martingale convergence 
term, a term describing the ability to perform well if future is known, and a generalization of sequential 
Rademacher complexity, studied in [55]. Since we directly study complexity of the problem instead of 
focusing on efficient algorithms, we are able to improve and extend many known results which have been 
previously derived via an algorithmic construction. 

1 Introduction 

In the companion paper we analyzed learnability in the Online Learning Model when the value of the 
game is defined through minimax regret. However, regret (also known as external regret) is not the only way 
to measure performance of an online learning procedure. In the present paper, we extend the results of |25j 
to other performance measures, encompassing a wide spectrum of notions which appear in the literature. 
Our framework gives the same footing to external regret, internal and general $-regret, learning with non- 
additive global cost functions, BlackwelPs approachability, calibration of forecasters, adaptive regret, and 
more. We recover, extend, and improve some existing results, and (what is more important) show that they 
all follow from control of the same quantities. In particular, sequential Rademacher complexity, introduced 
in [25], plays a key role in these derivations. 

A reflection on the past two decades of research in learning theory reveals (in our somewhat biased view) 
an interesting difference between Statistical Learning Theory and Online Learning. In the former, the focus 
has been primarily on understanding complexity measures rather than algorithms. There are good reasons 
for this: if a supervised problem with i.i.d. data is learnable, Empirical Risk Minimization is the algorithm 
that will perform well if one disregards computational aspects. In contrast, Online Learning has been mainly 
centered around algorithms. Given an algorithm, a non-trivial bound serves as a certificate that the problem 
is learnable. This algorithm-focused approach has dominated research in Online Learning for several decades. 
Many important tools (such as optimization-based algorithms for online convex optimization) have emerged, 
yet the results lacked a unified approach for determining learnability. 

With the tools developed in [25] , the question of learnability can now be addressed in a variety of situations 
in a unified manner. In fact, |25j presents a number of examples of provably learnable problems for which 
computationally feasible online learning methods have not yet been developed. In the present paper, we 
show that the scope of problems whose learnability and precise rates can be characterized is much larger 
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than those defined in [55] through external regret. Within this circle of problems are such well-known results 
as Blackwell's approachability and calibration of forecasters. For instance, our complexity-based (rather 
than algorithm-based) approach yields a proof of Blackwell's approachability in Banach spaces without 
ever mentioning an algorithm. Let us remark that Blackwell's approachability has been a key tool for 
showing learnability as our results imply approachability, they can be utilized whenever Blackwell's 
approachability has been successful. The results can also be used in situations where phrasing a problem as 
an approachability question is not necessarily natural. In Section |5.2| we discuss the relation of our results 
to approachability in greater detail. 

Our contributions can be broken down into three parts. 

• The first contribution lies in the formulation of the online learning problem, with a performance 
measure (a form of regret), defined in terms of certain payoff transformation mappings. While this 
formulation might appear unusual, we show that it is general enough to encompass many seemingly 
different frameworks (games), yet specific enough that we can provide generic upper bounds. 

• The second contribution is in developing upper and lower bounds on the value of the game under 
various natural assumptions. These tools allow us to deal with performance measures well beyond the 
standard notion of external regret. Such performance measures include smooth non-additive functions 
of payoffs, generalizing the "cumulative payoff" notion often considered in the literature. The abstract 
definition in terms of payoff transformations lets us consider rich classes of mappings whose complexity 
can be studied through random averages, covering numbers, and combinatorial parameters. 

• We apply our machinery to a number of well-known problems, (a) First, for the usual notion of 
external regret, the results boil down to those of [25] ■ (b) For the more general $-regret (see e.g. 
[261 1151 116]). we recover and improve several known results. In particular, for convergence to $- 
correlated equilibria, we improve upon the results of Stoltz and Lugosi [26 . (c) We study the game of 
Blackwell's approachability |4J in (possibly infinite-dimensional) separable Banach spaces. Specifically, 
we show that martingale convergence in these spaces (along with Blackwell's one-shot approachability 
condition) is both necessary and sufficient for Blackwell's approachability to hold, (d) We also consider 
the game of calibrated forecasting. We improve upon the results of Mannor and Stoltz [22] and prove (to 
the best of our knowledge) the first known 0(T -1 / 2 ) rates for calibration with more than 2 outcomes. 
Our approach is markedly different from those found in the literature, (e) We use our framework to 
study games with global cost functions and as an example we extend the bounds recently obtained 
by Even-Dar et al |10j . (f) We provide techniques for bounding notions of regret where algorithm's 
performance is measured against a time- varying comparator (see e.g. [T51 G3 HZ])- Such notions of 
regret are better suited for reactive environments. Using the general tools we developed, we not only 
recover the results in [18\ [6] but also extend them to prove learnability and obtain rates for much more 
general settings. Our last example shows that adaptive regret notion of Hazan and Seshadhri [17] can 
be defined in greater generality while still preserving learnability. 

The intent of this paper is to provide a framework and tools for studying problems that can be phrased as 
repeated games. However, unlike much of existing research in online learning, we are not solving the general 
problem by exhibiting an algorithm and studying its performance. Rather, we proceed by directly attacking 
the value of the game. Alas, the value is a complicated object, and the non-invitingly long sequence of 
infima and suprema can single-handedly extinguish any desire to study it. Our results attest to the power 
of symmetrization, which emerges as a key tool for studying the value of the game. In the literature, 
symmetrization has been used for i.i.d. data |13) . In [251 IT], it was shown that symmetrization can also 
be used in situations beyond the traditional setting. What is even more surprising, we are able to employ 
symmetrization ideas even when the objective function is not a summation of terms but rather a global 
function of many variables. We hope that these tools can have an impact not only on online learning but 
also on game theory. 
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We believe that there are many more examples falling under the present framework. We only chose a few 
to demonstrate how upper and lower bounds arise from the complexity of the problem. Along with an 
upper bound, a (computationally inefficient) algorithm can always be recovered from the minimax analysis. 
Finding efficient algorithms is often a difficult enterprise, and it is important to be able to understand the 
inherent complexity even before focusing on computation. 

Let us spend a minute describing the organization of this paper. Since our results are meant to serve as 
a unifying framework, we faced the question of whether to build up the level of generality as we progress 
through the paper, or whether to start with the most general results and then make them more specific. 
We decided to do the latter. While we find this flow of general-to-specific more natural, we risk losing 
potential readers on the first few pages. In hopes of avoiding this, after defining the online learning problem 
in full generality in Section [21 we briefly state how various well-known frameworks appear as particular 
instances. Then, in Section [3j learnability is established under various very general assumptions. Next, in 
Section [4j techniques for proving lower bounds are shown. Various examples and frameworks are considered 
in more detail in Section [5] In Section |6j the "in-probability" analogues are derived. Hannan consistency is 
established via almost sure convergence. For an overview of the results without the painful details, one may 
read Section [2] and then skip to Section [5] For the sake of readability, most of the proofs are deferred to the 
appendix. Let us remark that |25) is not required for reading this paper. In a few places, however, if a proof 
is basically the same as in |25| except for notation, we will omit the proof. 



2 The Setting 

At a very abstract level, the problem of online learning can be phrased as that of optimization of a given 
function Rr(/i, aci, . . . , fx, %t) with coordinates being chosen sequentially by the player and the adversary. 
Of course, at this level of generality not much can be said. Hence, we make some minimal assumptions on the 
function R^ which lead to meaningful guarantees on the online optimization process^] These assumptions 
are satisfied by a number of natural performance measures, as illustrated by the examples below. 

Let T and X be the sets of moves of the learner (player) and the adversary, respectively. Generalizing the 
Online Learning Model considered in |25j . we study the following T-round interaction between the learner 
and the adversary: 

On round t = 1, . . . , T, 

• the learner chooses a mixed strategy qt (distribution on J-) 

• the adversary picks Xt G X 

• the learner draws f t £ T from q t and receives payoff (loss) signal £(ft,Xt) G H 
End 

We would like to specify that we are in the full information setting and that at the end of each round both 
the player and the adversary observe each other's moves ft,%t- The payoff space % is a (not necessarily 
convex) subset of a separable Banach space B. Both the player and the adversary can be randomized and 
adaptive. 

The goal of the learner is to minimize the following general form of performance measure: 

R T = B(e(f u xi), . . . , £(f T , x T )) - inf B{£^ (h, Xl ), ...,£$ T (f T , x T )) , (1) 

where 

lr The question of general conditions on the function under which such sequential minimization is possible was put forth by 
Peter Bartlett a few years ago in a coffee conversation. This paper paves way towards addressing this question. 
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• The function £ : T x X i— > T~L is an %-valued payoff (or loss) function. 

• The function B : H T >— > R is a (not necessarily additive or convex) form of cumulative payoff. 

• The set $r consists of sequences <fi ~ (<pi, . . . , (fix) of measurable payoff transformation mappings 
(fit '■ H' FxX i — ^ T~L' FxX that transform the payoff function ^ into a payoff function t^ t . 

The goal of the adversary is to maximize the same quantity ([I]), making it a zero-sum game. 

This paper is concerned with learnability and with identifying complexity measures that govern learnability. 
But complexity of what should we focus on? After all, the general online learning problem is defined by the 
choice of five components: B, £, F, X, and $t- In [25], the choice was easy: it should be the complexity of the 
function class J- that plays the key role. That was natural because the payoff was written as £(f, x) = f(x), 
which suggested that the function class T is the object of study. The present formulation, however, is much 
more general. When this work commenced, it seemed likely that complexity of the problem will be some 
interaction between the complexity of <&t and complexity of T . As we show below, one may just focus on 
the complexity of <I>t, while T and X are now on the same footing. For instance, even if it might seem 
unusual at first, we will introduce a notion of a cover of the set of sequences of payoff transformations 
In summary, while all five components B,£,J-,X, and <f>x play a role in determining learnability, we will 
mainly refer to the complexity of the payoff mapping £ and the payoff transformation <f>x without an explicit 
reference to J 7 , X, and B. We emphasize that most flexibility comes from the payoff mapping £ and from 
the transformations $t of the payoffs. 

In particular, important classes of payoff transformation mappings are the departure mappings that transform 
the payoff function £ by acting only on the first argument of £, i.e. only modifying the row (player's action) 
choice. 

Definition 1. A class of sequences of payoff transformations $t is said to be a departure mapping class if 
there exists a class §' T of sequences <fi> — (</>' l7 . . . , (fi' T ) with <fi\ ; T h-> T such that for each <fi> g $ T there 
exists a <j> 6 $' T with the property that, for all i £ [T], f € T and x € X, the payoff transformations can be 
written as £ <i>t (f,x) := £(<fi' t {f),x). 

For payoff transformation classes that are departure mapping classes, the transformations $r can be iden- 
tified in terms of a corresponding class of departure mapping from J- to itself, and we shall abuse notation 
and use $x to represent both the class of payoff transformation and the class of departure mappings from 
T to itself. Another class of interest are payoff transformations that do not vary with time. 

Definition 2. We say that $t is time-invariant if all sequences of payoff transformation are constant in 
time: $t = {{<fi, ■ ■ ■ A) '■ <fi & $}> where $ is a "basis" class of mappings H :FxX ^ H :FxX . 

In the following, we assume that T and X are subsets of a separable metric space. Let Q and V be the sets 
of probability distributions on T and X, respectively. Assume that Q and V are weakly compact. From 
the outset, we assume that the adversary is non-oblivious (that is, adaptive). Formally, define a learner's 
strategy ?r as a sequence of mappings ir t : (V x J- x X)^ 1 H> Q for each t € [T]. The form ([TJ of the 
performance measure gives rise to the value of the game: 

V T (£,$ T ) =inf sup E ...infsup E sup {B{£{h, Xl ), . . . , £(f T , x T )) - B{l^{h, Xl ), . . . (f T ,x T ))} 

9i xi It Xt /t~<?t 0e$ T 

(2) 

where q t and Xt range over Q and X , respectively. With this definition of a value, the (deterministic) strategy 
of the adversary is a sequence of mappings (Qxjx X) 1 ^ 1 x Qh- X for each t £ [T]. 

Definition 3. The problem is said to be online learnable if 

limsup Vt(£, $t) = . 
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The value of the game is defined as an expected performance measure. As such, it yields "in probability" 
statements. We define the value of the game using a high probability performance measure in Section [6] We 
also discuss there how the high probability results lead to "almost sure" convergence. 

2.1 Examples 

A reader might wonder why we have defined the game in terms of abstract payoff transformation mappings. 
It turns out that with this definition, various seemingly different frameworks become nothing but special 
cases, as illustrated by the following examples. 

Example 1 (External Regret Game). Let H = M and 

• B(z 1 ,...,z T ) = ^ELi z * 

• $t = {(0 f , . . . , <f> t) : f G J- and (fit : J- i— > J 7 is a constant mapping (f>f(g) = / V<7 € J-} 
It is easy to see that Eq. ([I]) becomes 

T T 

Rr = i X) e &> - h S l & 



External regret is discussed in Section \5.1.1\ 
Example 2 (^-Regret). Let H = R and 

• B(zi,...,z T ) = 

• $x = {(4>, ■ ■ ■ , <fi) : (j> € <&} /or some /lied family <£> of J- J- mappings. 
It is easy to see that Eq. ([I]) becomes 



T T 



{=i 



*=i 



TTiis example covers a variety of notions such as external, internal, and swap regrets (see Section 5.1). 



Example 3 (Blackwell's Approachability) . Let % a subset of a Banach space B, S C B be a closed convex 
set, and 

• B(zi, . . . ,z T ) = inf ceS T^2t=i z t~ c 

• <&t contains sequences (</>i , . . . , 4>t) such that t<k t (/, x) = c t € S for all f € J 7 , x € X , and 1 <t <T. 
It is easy to see that Eq. (JlJ becomes 



R<r = inf 

cGS 



the distance to the set S. Indeed, our definition of <&t ensures that the comparator term is zero. BlackwelVs 
approachability is discussed in Section\5~. 
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Example 4 (Calibration of Forecasters). Let T~L = Mr, J- = A(fc) (the k- dimensional probability simplex) 
and X the set of standard unit vectors in M. k (vertices of A(k)). Define £(f,x) = 0. Further, 



B(z u ...,z T ) 



T ■ 



St=i z t f or some norm \\ ■ \\ on l$, k 
• $t = {(0 Pi A7 ■ ■ • i <f>p. a) : V G A(fc), A > 0} contains time-invariant mappings defined by 

e 4>p jf,x) = i{\\f- P \\<x}-(f-x). 

It is easy to see that Eq. ([I]) becomes 

I £l{||/ t -p||< A} ■(/,-*,) 



Rr = sup sup 

A>0 peA(fc) 



Calibration is discussed in more detail in Section\K 

Example 5 (Global Cost Online Learning Game [10]). Let H = R k , X = [0,l] fc , T = A(k), £(f, 
fQx = (f 1 -x\...J k -x k ). 

• B(z 1 , . . . ,z T ) = ^J2j =1 z t 

• $t = {(</>/, • • • i 4>f) '■ f G T and <f> f : T h-> T is a constant mapping (f>f(g) = f Vp G J 7 } 
is easy to see that Eq. (JlJ becomes 



R? 



X, 



inf 



1 T 



^4 generalization of this scenario is considered in Section 5.4 



2.2 Notation 

Let E x ^ p denote expectation with respect to a random variable x with a distribution p. Note that we 
do not use capital letters for random variables in order to ease reading of already cumbersome equations. 
For a collection of random variables x\,. . . ,xt with distributions pi, . . . ,pr, we will use the shorthand 
^xi-t~pi t t° denote expectation with respect to all these variables. Let q and p be distributions on T and 
X, respectively. We define a shorthand £(q,p) = Ef~ q , x ^ p l(f,x) and £<p(q,p) = Ef^ q ^ p £ < j ) (f,x). The Dirac 
delta distribution is denoted by 8 X . A Rademacher random variable Y is uniformly distributed on {±1}. 
The notation x a: b denotes the sequence x a , ■ ■ ■ , Xf,. The indicator of an event A is denoted by 1 {A}. The set 
{1, . . . , T} is denoted by [T], while the fc-dimensional probability simplex is denoted by A(fc). The set of all 
functions from X to y is denoted by y x , and the i-fold product X X ... X X is denoted by X* . Whenever 
a supremum (infimum) is written in the form sup a without a being quantified, it is assumed that a ranges 
over the set of all possible values which will be understood from the context. Convex hulls will be denoted 
by conv(-). 



Following |2S], we define binary trees as follows. 

Definition 4. Given some set Z, a Z-valued tree of depth T is a sequence (z 1; . 
%i : {±l} i_1 i-> Z. The root of the tree z is the constant function Zi e Z. 



. ,zt) of T mappings 



G 



Unless specified otherwise, e = (ei, . . . , ex) £ {±1} T will define a path. Slightly abusing the notation, we 
will write z t (e) instead of z t (ei :t _i). 

Let 0id denote the identity payoff transformation l$ id (f,x) = £(f,x) for all / e J 7 , x E X. Let I = 
{((/>id, • ■ • ,<^id)} be the singleton set containing the time-invariant sequence of identity transformations. 

For a separable Banach space B equipped with a norm || • ||, let J5||.|| be the unit ball. Let B* denote the 
dual space and Biu\ the corresponding dual ball. For a 6 B* , ||a||* = sup^g^ | (a, b) |. For b £ B, we write 
(a, b) = a(b) for the continuous linear functional a € B* on B. A Hilbert space is dual to itself. 

3 General Upper Bounds 

This section is devoted to upper bounds on the value of the game. We start by introducing the Triplex 
Inequality, which requires no assumptions beyond those described in Section [2j Under the additional weak 
assumption of subadditivity of B, we can perform symmetrization and further upper bound two of the three 
terms in Triplex Inequality by a non-additive version of sequential Rademacher complexity [25] . As we 
progress through the section, we make additional assumptions and specialize and refine the upper bounds. 

The following definition generalizes the notion of sequential Rademacher complexity, introduced in [25] . to 
"global" functions B of the payoff sequence. 

Definition 5. The sequential complexity with respect to the payoff function I and payoff transformation 
mappings <1>t is defined as 

«Kt(4$t,B) = sup E ei . T sup B(e 1 ^ 1 (fi(e),x 1 (e)),...,e T ^ T (fT(e),x T (e))) 
f,x 4>e<s>T v ' 

where the outer supremum is taken over all (J- x A")-valued trees of depth T and e = (ei, . . . , ey) is a sequence 
of i.i.d. Rademacher random variables. 

Whenever B is clear from the context, it will be omitted from the notation: $Ht(^, 3>t)- If $t is a set of 
sequences of time-invariant transformations obtained from the base class $, we will simply write 91t(£, 

Let us remark that the moves of the player and the adversary appear "on the same footing" in and 
in the above definition of sequential complexity. The "asymmetry" of sequential Rademacher complexity 
[25] (where the supremum is taken over the player's best choice) arises precisely from the asymmetry of the 
notion of external regret, which, in turn, is due to &t acting on the player choice only. In Section [5. 1.1[ we 
show that the notion studied in [25] is indeed recovered for the case of external regret. 

An equivalent way to write sequential complexity is through the expanded version 

9M^,$t,-B) = sup E ei sup E £2 ... sup E er sup B(e 1 £ tj>1 (f 1 ,x 1 ),...,e T £ < j >T (f T ,x T n (3) 

where the supremum on t-th step is over f t € J 7 , x t EX. We shall use Eq. (J3j) and the more succinct 
Definition [5] interchangeably. 

3.1 Triplex Inequality 

The following theorem is the main starting point for all further analysis. Because of its importance, we shall 
refer to it as the Triplex Inequality. The three terms in the upper bound of the theorem can be thought of as 
the three key players in the process of online learning: martingale convergence, the ability to perform well 
if the future is known, and complexity of the class in terms of sequential complexity. 
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Theorem 1 (Triplex Inequality). The following 3-term upper bound on the value of the game holds: 



< sup E ...sup E {B{t{h,x 1 ),...,tUT,x T ))- , E B(£(f[,x[), . . . Jif^x'r))} 



(4) 



sup inf . . . sup inf sup E 



{fl(*(/i, . . . , *(/ T , x t )) - B(£ 01 (/!,»!), . . . (/ T , xr))} 



'1:T~H:T 



sup E ...sup E sup < E Ble <t , 1 (f[,x' 1 ),...,e <f , T (ff r ,x' T ))-BU (t , 1 {fi,x 1 ) 

pi.gi /l~9l p T ,qT JT^qT 4>£$T I /l:T~9 1:T 



First, we remark that convexity of £? is nof required for the Triplex Inequality to hold. Under a weak 
subadditivity condition, the following Theorem gives upper bounds on the first and the third term. 

Theorem 2. If B is subadditive, then the last term in the Triplex Inequality is upper bounded by twice the 
sequential complexity, 2*Kt(^, 3>Tj B), and the first term is bounded by 29\t(£,Z, B) where T is the singleton 
set consisting of the identity mapping. Similarly, if B is subadditive, then the last term is upper bounded 
by 29\t(£, $t, —B) and the first term is bounded by 2£Ht(^,I, —B). 



Discussion of Theorem [T] and Theorem [2] 



First, let us mention that Triplex Inequality is not the only way to decompose the value of the game 
into useful and intcrprctable terms. In fact, slightly different decompositions yield better constants for 
some of the examples in this paper. Nonetheless, the Triplex Inequality seems to capture the essence 
of all the problems we considered and allows us to give a unified treatment to all of them. 



• We note that the first and the third terms are similar in their form, 
equivalently written as 



In fact, the first term can be 



sup E ... sup E sup < 

pi,gi p T ,<Jt /t~9t (he! 



B 



f'l: 



E B 

r ~<3l:T 



< ^4it (/ti x 't) 



where I only contains the identity mapping. If I C $t, then, trivially, D\t(£,Z, B) < 9\t(£,$t, B) 
and, therefore, an upper bound on the third term yields and upper bound on the first. However, in 
some situations &t is "simpler" or incomparable to I and, hence, the first and the third term in the 
Triplex Inequality are distinct. 

• What exactly is achieved by Theorem Let us compare the third term in the Triplex Inequality to its 
sequential complexity upper bound given by Eq. Both quantities involve interleaved suprema and 
expected values. However, in the former, the suprema are over the choice of distributions pt,qt and 
the expected values are draws of Xt, ft from these mixed strategies. In contrast, sequential complexity, 
as written in Eq. ([3]), contains suprema over the choices Xt, ft followed by a random draw of the next 
sign tf Crucially, it is easier to work with the sequential complexity as opposed to the third term 
in the Triplex Inequality since in the former the only randomness comes from the random signs. In 
mathematical terms, the tr-algebra is generated by {e t } rather than a complicated stochastic process 
arising from the Triplex Inequality. This is one of the key observations of the paper. 

• Depending on a particular problem, some of the terms in the Triplex Inequality might be easier to 
control than others. However, it is often the case that the first term is the easiest, as it naturally 
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leads to the question of martingale convergence. The second term is typically bounded by providing a 
specific response strategy for the player if the mixed strategy of the adversary is known. This response 



strategy is similar to the so-called Blackwell's condition for approachability (see Section 5.2 for further 
comparison) . The third term is arguably the most difficult as it captures complexity of the set of payoff 
transformations $7*. Under the subadditivity assumption on B, Theorem [2] upper bounds the first and 
third terms by the sequential complexity. 

• We remark that the first and third terms in Triplex Inequality contain suprema over the player's 
strategies qt instead of infima as in the definition of the value of the game. The proof of Theorem [I] 
points out the step where this over-bounding is done. While this might appear as a loose step, in 
all the examples we considered, this still yields the needed results. Nevertheless, as mentioned in the 
proof, one can substitute a particular strategy g t * for the first and third terms instead of passing to the 
supremum. For instance, q% can be the strategy which makes the second term in the Triplex Inequality 
small. To simplify the presentation, we decided not to include such analysis. 

• The following observation gives us a simple condition under which we can replace B with some other 
B' , and we shall find it useful in scenarios when it is difficult to directly deal with B. If B : T-L T i— >• K 
and B' : T-L T h-> M are such that Vzi, . . . , zt € %, B(zi, . . . , zt) < B' (z\ 1 . . . , zt) then we have that for 
any class of transformations <!>T; 

m T (£, $t, b) < m T {£, $t, b 1 ) . (5) 

• Finally, let us mention that we could have defined the performance measure in as 

R T = sup B(£ lj> ' i (f 1 ,xi) ) ... ) £^ T (f T ,x T )) - B(£ <t>1 (fi,x 1 ) ) ... ) £ <j>T (f T ,x T )) . (6) 
(4>',0)e($' T x#T) 

Clearly, § can be expressed as an instance of (JsJ) by setting <&' T = I. Conversely, if B is, for instance, 
an average of its coordinates, we can view definition |I| as a particular case of ([!]). Indeed, given 
a payoff £ and sets &' t ,§t of transformations, define a new payoff £(f,x) = and £((f>> t ,tj> t )(f,x) = 
-(!^(/,a;) — £ r f >t (f,x)). Then becomes exactly While the analysis presented in this paper 
can be extended for ([6]), in the examples we consider, the definition ([I]) of performance measure is 
expressive enough. 

We now detail upper bounds on this complexity under the smoothness assumption on B. The smoothness 
assumption covers many important cases, such as norms. 



3.2 General Bounds for Smooth B 

As shown by Pisier [21] and Pinelis [53] , existence of a smooth norm in a Banach spaces is crucial in the study 
of exponential inequalities for martingales. Using similar techniques, we show that a smooth function B 
will admit upper bounds in terms of certain increments. This will yield general tools for studying sequential 
complexity for smooth functions B. Informally, the smoothness assumption provides a link from a "global" 
function of coordinates to a sum of its parts. From the point of view of online learning, this is very promising, 
as it appears to be difficult to sequentially optimize a "global" function of many decisions. 

Consider the following definition of smoothness. 

Definition 6. Function G : H n- R is said to be (a, p) -uniformly smooth on H for some p £ (1, 2] and a > 

if, for all z, z' € "H, we have, 

G(z) < G(z') + (VG(z'), z - z') + - \\z - z'\\ p 
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We say that G is uniformly smooth if there exist finite a and p such that G is (cr, p)-uniformly smooth. We 
say that the space (0, || • ||) is (7,p)-smooth when the function || • \\ p /p is (7,p)-uniformly smooth. 

A function which is smooth in its arguments can be "sequentially linearized" , with additional second-order 
terms as norms of the increments. We establish the following upper bound on the first term of the Triplex 
Inequality. 

Lemma 3. Suppose B is subadditive and for some q > 1, B q is (a, p) -uniformly smooth in each of its 
arguments. Suppose B(0, ... ,0) = and that for any x G X and f G J- it is true that \\£(f,x)\\ < r\. Then 
the first term in the Triplex Inequality is bounded by ((2?y) p o~T / p) 1 1 q . 

Under the assumptions of Lemma |3j we can also provide an upper bound on the third term. Lemma [4] below 
says that the sequential complexity defined through a smooth function B can be upper bounded by the 
sequential complexity involving a sum of first-order expansions of B. 

Lemma 4. Assume that for some q>l, B q is (a, p) -uniformly smooth in each of its arguments, B(0, . . . , 0) = 
and that for any x G X, f G T, <fi £ 3>t andt G [T], it is true that \\£(/> t {f,x)\\ < V> then we have that 



m T (£^ T )< (sup E ei . T sup Ve tflt (^ 1 (fi(e),xi(e)),...,^ t (f t (e),x t (e))) J 



1/9 

+ {<JT] P /p) 1/q T 1/q 



whe 



fft (^ 1 (f 1 (e),x 1 (e)),...,^ t (f t (e),x t (6))) 

= (V tJ B 9 (e 1 ^ 1 (f 1 (e),x 1 (e)),..., et _ 1 ^ t _ 1 (f t _ 1 (e),x t _ 1 (e)),0,...,0),^ t (f 4 (e),x f (e))) . 

By taking gradients at successive time steps, we reduced the study of a global function B to the study of 
its gradients. A reader familiar with [5S] will notice that the first term of Lemma [4] (under the power of 
1/q) resembles sequential Rademacher complexity. The first step in studying this term is to ask what can 
be done with a finite class To approach this question, we state a lemma from |25j . 

Lemma 5. \25$ For any finite set V of M.-valued trees of depth T we have that 



T 

\r(Z W * * 



vev 

t=i 



< 



T 



A 21og(|V|)max max > v t (e) 2 



The above Lemma can be used to show the following result for any finite set of transformations <£>x. 

Proposition 6. For any finite set of payoff transformations under the conditions of Lemma [J and 
assuming 

||V t B 9 ( ei ^ 1 (f 1 (e),x 1 (e)), . . . , e t _i^ t _ 1 (f t _i(e), x t _!(e)), 0, . . . ,0) || < R 

then 

DIt(£,<S>t) < {2i 1 2 R 2 \og{\^ T \)T) 1/2q + (ai 1 p /p) 1/q T 1 ' q . 

Hence, if $x is finite, sequential complexity is bounded whenever B is smooth and the gradients of B are 
bounded by R. Typically, R is of the order 0(1/T) if B is appropriately normalized to account for T (for 
instance, if B is an average of its coordinates). Similarly, a is either zero or o(l) for the examples considered 
in this paper. With the appropriate behavior of the online covering number, the bound yields learnability 
according to Definition [3] 
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3.3 When B is a Function of the Average 



For the rest of this sub-section we consider B of a particular form. We assume that, 



where some power of G is (7, p)-smooth function on the convex set conv('H) for some 1 < p < 2. This form of 
B occurs naturally in many games including Blackwell's approachability and calibration. Among the most 
basic smooth functions are powers of norms, as the next example shows. 



Example 6. Consider B of the form 



B(zi, 



,Zt) 



1 T 



The three cases q € (1, 00) , q = 1, and q = 00 are considered separately. Here G 
in checking if G s is uniformly smooth for some power s. 



ana we are 



interested 



► q G (l,oo) For any q G (1,2], G q (z) = \\z\\ q is (q,q) -uniformly smooth and for any q G [2, 00) the 
function G 2 (z) = \\z\\ 2 is (2(q — 1) , 2) -uniformly smooth. 

► q = 00 Unfortunately, for no finite power s is G s uniformly smooth. However, for any z G % and 
any q' G (l,oo), ||z||oo < I! z l!<j'- Hence we can use Q and upper bound the sequential complexity 

m T (£, $ T , B) < D\ T (£, $ T , B') 

where B'(zi, . . . , zt) = ^ X)t=i z t ■ By choosing q' appropriately and using the smoothness of the 

1' 

L q i norm (previous case) we can provide upper bounds for the value of the game. 

► q = 1 As in the previous example, for no finite power s is G s uniformly smooth. However ifHQ M. d , 
then for any z G H and any q' G (l,oo), ||^||i < Cg',d||^||g' where C q >^i is a constant dependent on q' 
and dimension of the space d. Again we can use ([5| and upper bound 

<K T (^, $t, B) < m T (i, $t, B') 

where B '(zi, . . . , Zt) = h Y]f—i Zt ■ Choosing q' appropriately and using the smoothness of the L q > 

1' 

norm we can provide upper bounds for the value of the game. 



For a concrete example of a smooth norm, we refer to the calibration example of Section 5.3 We now 
specialize the statement of Proposition [6] to the specific assumption on B. 

Corollary 7. Let $r be a finite set of payoff transformations. Assume that for some q > 1, G q is (j,p)- 
smooth function for some 1 < p < 2. Also assume that ||VG 9 (z)IL — P f or an V z ^ conv('H). Further, 
suppose that for any x€X,fGJ 7 ,(j)G $t and t G [T], it is true that \\£ ( j >t (f,x)\\ < r\. Then it holds that 



m T {£, $t) < 



2r/ 2 log(|$ T |) 
T 



l/2q 



(jV P /p) 1/q T (1 - p)/q 



The above result is a direct corollary of the more general Proposition [6] in the case where B is a function of 
the average. It turns out that we do not always get the best convergence rate in this manner. The following 
result shows that if G is 1-Lipschitz and G 2 is 2-smooth, we should obtain a 0(1/ VT) convergence rate. 
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Lemma 8. Let $t be a finite set of payoff transformations. Assume that B{z\, . . . , zt) — G ^ X)t=i z tj 

where G > is 1-Lipschitz with respect to a norm \\ ■ \\, G(0) — and G 2 is (7, 2)-smooth function. Further, 
suppose that for any x G X, f G J- , <p G $t and t G [T], it is true that \\i(j> t {f, x)\\ < r\. Then, for 
T > log(2|$ T |)/7, it holds that 

K T {t,* T )<2JrtW\*T\ 



T 

The next result generalizes the above lemma to the case when the exponent of smoothness is different from 
2. Because of a different proof strategy, there are two differences between the next lemma and the previous 
one. First, instead of assuming smoothness of some power of G, we instead assume that the space (B, \\ ■ ||) 
is (7,p)-smooth. Second, we get extra log(T) factors that are probably an artifact of our analysis. 

Lemma 9. Let $t be a finite set of payoff transformations with \<&t\ > !■ Assume that B{z\, ■ ■ ■ , Zt) 
( i (jp Y^t=i z t) where G > is l-Lipschitz with respect to a norm \\ ■ \\ and G(0) = 0. Suppose that (B, \\ ■ ||) 
is a (7,p)- smooth space. Further, suppose that for any x^X,f<EJ-,4><E $t and t G [T], it is true that 
\\£<t>t (/i x )\\ — V- Then, for any T > 3, it holds that 



4c7 1/p loe 3 / 2 T / 

£M4*t) < Tl -x/ p v / '7 2 log(2|$ r |) 



for some absolute constant c. 



Having a bound on the complexity of a finite set of payoff transformations, we seek to extend the results to 
infinite sets. A natural approach is to pass to a finite cover of the set at an expense of losing an amount 
proportional to the resolution of the cover. Before proceeding, however, we need to define an appropriate 
notion of a cover. The following definition can be seen as a generalization of the corresponding notion 
introduced in [5S]. We remark that the object, for which we would like to provide a cover, is the set $t 
of payoff transformations. Whenever payoff transformations are simply constant time-invariant departure 



mappings, complexity of <&t identical to that of T ', yielding the online cover of class T (see Section 5.1.1 for 
more details). In general, however, the set of payoff transformations can be much more complex than (or 
not even comparable to) T . 

Definition 7. A set V of H-valued trees of depth T is an a-cover (with respect to l p -norm) of $t on an 
(J 7 x <Y)-valued tree (f , x) of depth T if 

V0 G <&t, Ve G {±1} T 3v G V s.t. f-^|| Vt (e)-^ t (f i (e) ! x t (e))|n <a (7) 

The covering number of the set of payoff transformations on a given tree (f , x) is defined as 

Af p (a, $t, (f, x )) = minU^I : V is an a — cover w.r.t. ^ p -norm of $t on (f, x) tree}. 

Further define N p {a, <J>t, T) = sup^ f x \ Af p (a, (f j x ))i the maximal £ p covering number of $^ over depth 
T trees. 

This definition of the cover is indeed the most general for the setting we consider in this paper. In sections 
that follow, we specialize this definition to fit particular assumptions on 

We now give generalizations Dudley's bound for the case when B is a function of the average. 

Theorem 10. Assume that B(z\, . . . , Zt) = G (h Y^t=i z tj where G > is sub-additive, l-Lipschitz with 

respect to a norm || ■ ||, G(0) = and G 2 is (7, 2) -smooth. Further, suppose that for any x G X, f G T , 
<p G $t and t G [T], it is true that ||^ t (/,ic)|| < 1. Then it holds that 

«Kr(4 $ T ) < 4 inf |a + 6^ J y / I^KUK^T)d(3 
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3.4 General Bounds Under Linearity Assumptions on B 

The general results of the previous section can be restated in simpler terms once more assumptions are made. 
In particular, some of the terms in the three-term decomposition in Theorem[T]can be dropped as soon as B is 
linear. While some of the results below can be repeated for a more general form B{z\, . . . , Zt) — Y^t=i ( c t> z t) 
(for some c\ , . . . , Cr G B* and H C B), for simplicity we assume that B is an average of its arguments and 
that HC1: 

1 T 

B(z u ...,z T ) = - ^z t . 

t=i 

Of course, such B is trivially smooth (with a = 0), so all the results of the previous section apply. 
Corollary 11. The following statements hold: 

• The first term in the Triplex Inequality is zero. 

• // $t *5 0* class of departure mappings, then the second term in the Triplex Inequality is non-positive. 
In this case, 

V t (£,$t) < 25H T (£,$ T ). 

• Let HQ [-1, 1]. We have, 



SRtU, $t) < 4 inf I a + 6v^ / 



logAU<5,<E T ,r) ^ 



Note that the use of covering numbers in the above result is not essential. In the case T~L C [—1, 1], we 
can use li covering numbers by adapting the proof of Theorem 9 in |25j . 

When B is the average of its coordinates, the sequential complexity takes on a familiar form: 

T 



D* T (£,$ T ) = sup E £1:T sup ^^e t ^ t (f t (e),x t (e)). 

f.X 0G$T J t = 1 

Further, for WCR, Eq. ([7| in definition of the cover becomes 

V0 e $ T , Ve € {±1} T 3v G V s.t. ( -^| Vt (e)-^ t (f t (e),x 4 (e))|M <a 
where V is now a set of K- valued trees. 

A further simplification of various notions is obtained for time- invariant payoff transformations. Moreover, for 
time-invariant payoff transformations we can define combinatorial parameters, generalizing the Littlestone's 
[2"Tll3"] and fat-shattering dimensions [25] . This is the subject of the next section. 



3.4.1 Combinatorial Parameters for Time-Invariant Payoff Transformations 

Assume MCI. Consider time-invariant payoff transformations generated from some base class of payoff 
transformations $ (see Definition [2]) . That is, $t = {{4>: ■ ■ ■ > 4>) '■ 4 1 £ We have the following definition 
of a generalized shattering dimension. 

Definition 8. Let T~L = {±1}- An (J 7 x A")-valued tree (f, x) of depth d is shattered^ by a payoff transfor- 
mation class $ if for all e G {±l} d , there exists such that ^(ft(e), x t (e)) = e t for all t £ [d]. The 
shattering dimension Sdim($) is the largest d such that $ shatters an (J 7 x A")-valued tree of depth d. 

2 As a historical aside, the term "shattered set" was introduced by J. Michael Steele in his Ph.D. thesis in 1975. 
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We can also define the scale-sensitive version of the shattering dimension, generalizing the fat-shattering 
dimension of [25] , 

Definition 9. An (F x A")-valued tree (f, x) of depth d is a-shattered by a payoff transformation class $, 
if there exists an K-valued tree s of depth d such that 

Ve e {±l} rf , 30 e $ s.t. Vt e [4 e t (^(f t (e),x t (e)) - s t (e)) > a/2 

The tree s is called the witness to shattering. The fat-shattering dimension fat Q ($) at scale a is the largest 
d such that $ a-shatters an (J 7 x A')-valued tree of depth d. 

Slightly abusing notation, we write Af p (a, (f, x)) instead of Af p (a, <I>t, (f , x )) whenever $y consists of 
sequences of time-invariant payoff transformations with a base class $. 

The combinatorial parameters are useful if they can be shown to control problem complexity through, for 
instance, covering numbers. We state the following three results without proofs, as the arguments are 
identical to the ones given in [53]. To be precise, the (f, x) tree here plays the role of the x tree in [55J, 1$ 
for cj) £ $ plays the role of / e T in [2"5] . 

Theorem 12. Let H C {0, . . . , k} and fat 2 ($) = d. TTien 

AU1/2,$,T) < £ ( T V < (ekTf. 



Furthermore, for T > d 



i=0 



T\, fekT\ d 



E f r" s 



- 



d / 



We now show that the covering numbers are bounded in terms of the fat-shattering dimension. 

Corollary 13. Suppose % C [— 1, 1]. TTien /or any a > 0, any T > 0, and any (J 7 x X)-valued tree (f,x) o/ 

/2eT\ fat ° W 



M(a,$,(f,x)) <AA 2 (a,$,(f,x)) ^^(^^(f.x)) < 
Theorem 14. Let % C {0, ... , fc} and fati($) = d. T/ien 

AA(0, $, T) < £ ( TN ) fc 1 < (e/fcT) d . 



\ a J 



Furthermore, for T > d 

EC, I 1 ' 



"' 'TV, (ekT^ d 



i=0 

In particular, the result holds for binary-valued function classes (k — 1), in which case fati(<i>) = Sdim($). 

The generality of these results is evident, as both the combinatorial parameters and covering numbers are 
defined for any performance measure ([I]) with time-invariant payoff transformations. In particular, this 
includes $-regret (see Section 5.1). 
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3.5 General Bounds for Slowly- Varying Payoff Transformations 



In Section |3.4.1[ we assumed that the set $t of sequences of payoff transformations is time-invariant. 
This assumption naturally leads to a control on the complexity of $j\ Lifting the assumption of time- 
invariance, we now go back to the level of generality of Proposition [6] We observe that size of $t or an 
appropriately behaving covering number ^(q^t,^) is key for bounding the sequential complexity. If 
payoff transformations change wildly in time, there is little hope of getting non-trivial bounds. The good 
news is that, under some assumptions on the variability of the sequences in "Jy, we can get a bound on the 
covering number of $>t- 

It has been shown in [18, 6 that it is possible to have small external regret against comparators that change 
a limited number of times. This alleviates an obvious limitation of the classical notion of external regret, 
viz., comparison to the fixed best decision. Another result of this flavor appears in [27] . where dynamic 
regret is defined with respect to a comparator whose path length is bounded. In general, one can consider 
situations where we would like to compete with a budgeted comparator. We now show that the assumptions 
of slowly-varying or budgeted comparators are naturally captured by our framework through the notion of 
slowly-changing payoff transformations Furthermore, the control of covering numbers of $^ becomes 
transparent under such assumptions. Our goal here is not to provide a comprehensive list of possible results, 
but rather to show versatility of our framework. 

3.5.1 Tracking the Best Transformation 

Suppose <!> is a finite set of payoff transformations. Let be obtained by considering all piecewise constant 
sequences with k changes: 

= {(^l) • • • ) 4>t) '■ 1 = *o < h < • • ■ < i-k < T and 4>t — 4>v if is < t < t' < i s +\ for some s > 0}. 

If cardinality |$| = N, it is easy to check that |$^| < (^) ■ iV fe+1 . Under the assumptions of Proposition JgJ 
this immediately implies a bound of the order 

(R 2 {k log N + k log T)T) 1/2q + a 1 / q T 1 ' q . 

It is natural to extend the above results by lifting the assumption that <!> is a finite set of payoff transfor- 
mations. This can be done by considering an online cover Af p (£, <&, a) of $ in some £ p norm along with the 
same definition of Next we do this in an even more general setting. 

3.5.2 Slowly Changing Transformations 

To start, suppose $t consists of payoff transformations {<p\, . . . , 4>t) which are "almost" time-invariant within 
each of k + 1 intervals. Consider the following definition: 

= {(0i, ...,<h) ■ 1 = to < k < ••• < ik < T 

and sup||^ t (/,x) —£$,(/, x) || < a if i s < t < t' < i s+ i for some s > Of. 
f,x > 

One can think of the time-invariant segments as "accumulation points" where the payoff transformations do 
not vary much. 

Suppose that we have a finite cover V of $ at scale a, of cardinality 1^1= A/"oo(a, ®,T). The L M covering is 
chosen for the purposes of simplicity, though tighter (and more difficult) results are expected from directly 
studying Li covering numbers. 
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Lemma 15. IfAf^a, <&, T) is finite, 



JV 00 (2a,**? a ,T) < ( ; J ■Af 00 (a,^,T) k+1 



T 
k 

Further extending the above results, we will now study the size of an online cover if <I>t consists of payoff 
transformations of bounded length. In general, "length" can be defined as some budget given by the setting 
at hand. Here, we present a straightforward approach without an attempt to give very general and tight 
bounds. 

Suppose that $t is a set of sequences (<f>±, . . . , 4>t) of payoff transformations which do not "vary much", 
according to the following definition. The length of a sequence ((f>i, . . . , 4>t) of payoff transformations (with 
respect to distance) is defined as 

T-l 

len(</>i,. . . ,<f> T ) := sup ||^ t (/» - ^ t+1 (/,z)|| . 
t=i f< x 

Again, we consider the distance between payoffs (as functions over JFx X). Assume that for all sequences 
in their length is bounded by some L > 0. We will now claim that by choosing k large enough, the 



set of covering trees V k defined in the proof of Lemma [15] provides a cover for at a given scale a > 0. 
Consider any (<j>i, . . . , <j>x) € We construct the nondecreasing sequence it, . . . , ij, . . . € {1, . . . , T} of 

"change-points" as follows: increase t until the next payoff transformation is farther than a from the payoff 
transformation at i,- 



3 ' 




i,hAf,x)-iMx) 



> a 



Let k be the length of the largest such sequence for all elements of $t- We have simply reduced the problem 
to the one studied in the previous section: within each block, all the payoff transformations are close. 

Clearly, k = k(a) < L/a, but can potentially be smaller under additional assumptions on We then have 
a bound on the size of a 2a-cover of <&t'- 



and 



log Moo {2a, $ T ,T) < O [ -\ogT+-\ogU 00 {^,T) 

a a 



The covering number can be now used, for example in Theorem |10| to control sequential complexity when 
B is a function of the average. We note that it is possible to derive analogous Dudley's integral type bound 
solely under smoothness assumptions on B. 



4 Techniques for Lower Bounds 

It is well-known that an equalizing strategy (i.e. a strategy that makes the move of the other player "irrel- 
evant") can often be shown to be minimax optimal. In this section, we define a notion of an equalizer for 
our repeated game and show that it can be used to prove lower bounds on the value of the game. While 
existence of an equalizer has to be established for particular problems at hand, the lower bounds below hold 
whenever such an equalizer exists. 
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Definition 10. A strategy {p* t } for the adversary is said to be an equalizer strategy if 

E ... E K T ((f 1 ,x l ),...,(f T ,x T )) = E ... E R T ((f 1 ,x 1 ),...,(f Tl x T )) 

f-L~dl Jt~1t Jl~~if / T ~<4 

for all strategies {g t *} and {<j£ } of the player. Here Rt is defined as in ([!]). 

Using the above definition of an equalizer we have the following proposition as an immediate consequence. 
Proposition 16. For any Equalizer strategy {p^} we have that for any f G J- . 



V T tt,® T )> E ... E 

Xl~Pl x T ~p T 



where p t = p* ({f s = f, 



Remark 1. For many interesting games we consider it is often the case that for any X\, . . . ,Xj* and any 
/l) • • • j It, fx-, ■ ■ ■ j fr> 

inf B (ifaifuXx), . . . ,l<j> T {fT,x T )) = inf B (£^ (f[, xi), . . . , t$ T (f T , x T )) 

In these cases since the player's actions do not even affect the second term of the regret, to check if a strategy 
{Pt} is an equalizer or not we only need to check if 

E ... E B(l(f 1 ,x 1 ),...,£(f T ,x T ))= E ... E B (£(fi,xi), . . .,£(/t, xt)) 

xi~pl x t ~Pt xi~pl xt~Pt 

/l~9* fT~«T fl~vf J T ~'^r 

for all strategies {q%} and {(? t *} of the player. 

Interestingly enough, many of the existing lower bounds in online learning literature are, in fact, equalizers 
(see e.g. [H p. 252]). In particular, in [T], a lower bound on the value of the game was derived by looking at a 
certain face of a convex hull of loss vectors. The face, supported by a probability distribution p, corresponds 
to the set of functions with the same expected loss under the distribution p. Hence, p is an equalizing 
strategy for those functions. Since these functions are the "best" with respect to this distribution, a lower 
bound in terms of complexity of this set was derived in [T]. Furthermore, |19| shows that a lower bound on 
the rate of convergence in the i.i.d. setting is achieved when there are two distinct minimizers of expected 
error for a given distribution. Again, this distribution can be viewed as an equalizer for the non-singleton 
set of minimizers of expected error. 



5 Examples and Comparison to Known Results 

We now turn to several specific settings studied in the literature and look at them through the prism of 
our general results. While we believe that online learnability in many different scenarios can be established 
through our framework, we decided to focus on several major problems. On the surface, these problems are 
quite different; yet, through our unified approach we show that learnability can be seamlessly established for 
all of them. The unification not only leads to simpler proofs and sharper results, but also yields insight into 
the inherent complexity and ways of making more comprehensive statements. 
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5.1 $-Regret 

In this section, we consider a particular notion of performance measure, known as <!>-regret |26l 1151 ITS] . In 
our framework, this means that we restrict ourselves to only time-invariant departure mapping classes $t 
specified by a base class $ of mappings from T to itself (see Definitions [I] and [2]). The particular choices of 
<!> lead to various notions, such as external, internal, swap regret, and more. 

To define $-regret (Example [5]) , we fix a set $ of departure mappings which map J- to J- and define the 
set of time-invariant departure mappings $^ '■= {{4>, ■ ■ ■ j 0) : 4> € Then the measure of performance 
becomes <&-regret: 

T T 

where MCR. Since £? is the average of its arguments, Corollary [XT] implies 
Corollary 17. In the setting of <fr -regret, 

Specializing the definition of sequential complexity to $-regret, we obtain the following definition. 
Definition 11. The sequential complexity for <E>-regret is defined as 

T 

if. xi .--'!: ' 



1 T 

D\ T (e, *) = sup E ei:T sup - ^ ° f*(e), x t (e)) (8) 



where, as before, the first supremum is over T x A"- valued trees (f , x) of depth T . 

The following property allows us to immediately obtain bounds for convex hulls of finite sets 
Proposition 18. Suppose I is convex in the first argument and conv(<I>) maps T into J 7 . Then 

JH T (£,conv($)) = 9tr(4$) • 

We also have the following version of the contraction lemma, whose proof is identical to that given in |25) . 

Lemma 19. Fix a function ip : R x J 7 x X t— > R such that for any f 6 J-, x G X, ip(-,f,x) is a Lipschitz 
function with a constant L. Then 

<n(v>o£,$) < L-m(e,<z>) 

where tp o £ is defined by the mapping (/, x) h-> ip(£{f, x), /, x) for all f G F,x 6 A". 
Next, we specialize Definition [7] to the particular case of $-regret. 

Definition 12. A set V of K-valued trees of depth T is an a-cover (with respect to ^ p -norm) of $t on the 
T x A-valued tree (f , x) of depth T if 

i/p 



G $, Ve G {±1} T 3v G V s.t. _y^| Vt (e)-^of t ( e ),x t (e))| p ) <a 



^Eiv t ( e )-^of t ( e ), Xt ( e ))|M < 

t=l / 

The covering number of on a given tree (f , x) is defined as the size of the minimum cover, as in Definition]?] 
We now turn to particular examples to utilize the results and definitions stated above. 
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5.1.1 External Regret 



External regret is the simplest example of $-regret. We separate it from the general discussion in order to 
show that for external regret the various notions introduced in this paper reduce to the ones proposed in 

E3- 

Considering the definitions in Example [T] notice that the time-invariant departure mappings class &t is 
chosen to be the class of sequences of constant mappings {(<fif, ■ ■ . , <fif) ■ f G T and 4>f(g) = f Vg G J 7 }. 
It is precisely because of this constancy of </> that the dependence on the .F-valued tree f disappears from 
all the definitions and results. Further, because of the obvious bijection between elements of $t and J-, 
minimization (maximization) over $r can be written as minimization (maximization) over J- ' . Notice that 
the action of <pf on the payoff is £ c j lf (f tl x t ) — £(f, x t ). 



Let us turn to Definition 11 of the sequential complexity for $-regret. Because each </>/ <E $ is a constant 
mapping, we have 



1 

«R T (*,*) = sup E ei . T sup- Ve t f(/,x t (e)) 

1 T 

= sup E ei!T sup - y"e 4 £(/,x 4 (e)). (9) 
If payoff is written as £(f,x) — f(x), this is precisely the sequential Rademacher complexity defined in [25] . 



Next, we show that Definition 



12 



! reduces to the definition of online covering given in 25 . Indeed, £ c / if (f t (e) , x t (e 
£(/, x t (e)) for the constant mappings (f> = . . . ,(/>/)■ Further, the payoff space MCI, With these sim- 
plifications, the closeness to a covering element in Definition |l2"| becomes 

V/ e F, Ve G {±1} T 3v G V s.t. [ - > ' |v f (e) - £(f, x f (e))\ p ) <a 



\ t=i 



where V is a set of K- valued trees. It is then immediate that Corollary [TT| recovers the corresponding result 
of [25 . For a detailed study of external regret, we refer the reader to the companion paper |25) . 



Lower Bounds in the Supervised Setting We provide a lower bound for external regret in the super- 
vised learning setting using the notion of an equalizer (see Section|4|. To this end, we assume that X — Z x y 
where Z is the space of predictors and y is the space of responses (outcomes). The setting is called supervised 
because, in the machine learning terminology, the observed data is thought of as examples together with 
labels. Assume J 7 is a class of bounded real- valued functions and the space of outcomes is a bounded interval; 
for simplicity let T C [—1, \] z and y = [—1, 1]. Suppose the loss is of the form £(f, (z,y)) = \f(z) — y\. 

Proposition 20. The value of the supervised game defined above is lower bounded by sequential Rademacher 
complexity: 

Vt(£,$t) > <Kt(^$) 

Proof. Recall that we have a fixed set $ of constant departure mappings. We will now exhibit an equalizer 
strategy. Following Remark[Tj observe that for any (z\, y{), . . . , (zt, yr) and any /i, . . . , fx, /{,..., f' T , 




because any <f> G $ is a constant mapping. Thus, for a strategy to be an equalizer, it only needs to "equalize" 
the cumulative loss of the player. Here is how we construct such a strategy. Let p v be defined as the 
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distribution of a Rademacher ±1 random variable Y; this will define the labels yt as independent coin flips. 
Now, fix any Z-valued tree z of depth T. Let {p^} be a strategy defined by p*{y\-t-\) — fiz t (yi-t-i) x P l '> 
a delta distribution on z t (yi-.t-i) defined by the tree z and p v on y. In plain words, the strategy of the 
adversary for each t is to choose a particular z t £ Z given the labels y\,..., J/t-i, and let the label be an 
independent Rademacher random variable. 

By Remark [TJ it is enough to check 



, E .... E ^Y.\fM)-Vt\= ( E .... E i^|/ t (^ t ) 



Vt\ 



for all strategies {qj 1 } and {q^} of the player. This equality is indeed true because E |a — y t | = 1 

yt~p y 

independently of the constant a E [—1,1]. By Proposition 16 for any g G T 

T 



V|(I,$ T ) > 



E 



E 



(zi,yi)~Pl {z t ,Vt)~Pt 
T 



2/*| 



= E 
yi,—,VT 



E 

Vi,—,Vt 



/=1 



1 

sup ^2Zyt/(zt(yi:t-i)) 



where yi, . . . , yj* are i.i.d. Rademacher random variables. Since the lower bound holds for any Z- valued tree 
z of depth T, it also holds for the supremum: 



Vf(4* T ) > sup E 



z yi,--->yr 



1 T 

sup -^2/J(z t (?/i :t _i)) 



JRr(*>*) 



Hence, the lower bound on the value of the supervised game is the sequential Rademacher complexity of 
T. □ 



Lower Bounds for Online Convex Optimization We first provide a lower bound for a linear game. 



By Lemma 42 this lower bound will also serve as a lower bound for a convex Lipschitz game. We remark 
that these lower bounds are not entirely new (see e.g. [HE]), and we derive them here for the purposes of 
completeness, as well as to stress that they arise from an equalizing strategy. 

Suppose J 7 is a unit ball in some norm || • || and X is a unit ball in the dual norm || ■ ||*. The loss 
£(f,x) = x(f) = (f,x) and the set $ is, again, a set of constant departure mappings. 

Proposition 21. The value of the linear game defined above is lower bounded by sequential Rademacher 
complexity: 

V T (l,<5> T ) > m T (£,<f>). 

Hence, the value of the convex Lipschitz game (where X is the set of all 1-Lipschitz convex functions on T) 
is also lower bounded by the same quantity. 

Proof. Similarly to the proof for the supervised game, observe that for any x\, . . . , Xt and any /i, . . . , fj<, /{, . . . 

T T 

K % E WO. **) = M t E Wt)>**) 

t=l t=l 
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because any <f> € $ is a constant mapping. Following Remark [T] we only need to exhibit a strategy that 
equalizes the player's loss. To this end, fix an A"-valued tree x of depth T. Consider the adversary's strategy 
where at each step an e t is chosen uniformly at random from {±1} and x t — tt ■ x(ei:t-i) G X . 

By Remark [TJ it is enough to check 



T T 

E E.. E t E-^e t (/t,x(e 1:M ))= EJE... E E — ^ 



e t (/ t ,x(ei :t _i)) 



for all strategies {g^} and {g t *} of the player. This equality is indeed true because both terms are identically 
zero. By Proposition |16[ for any g G T 



V t (^,$t)> E 

ei,...,e T 



1 T 1 T 

t=l ■' e t=l 



1 

E SUP T^ £t (/» x ( £ l:t-l)) 



Since this holds for any A"-valued tree x, we have proven the statement. 



□ 



5.1.2 Internal and Swap Regret 

Assume the cardinality N = \F\ is finite. For internal regret, <& is the set of mappings {<i>f->g ■ 4>f^g{f) 
if and cj)f^g(h) = h V7i ^ f,h G J 7 }. For swap regretEJ [8] , <!> contains all TV^ functions from T to itself. 
It is easy to see that the finite class lemma (Lemma p| immediately recovers the 0(\/T log N) bound for 
internal and external regret and the 0(y/TN log N) bound for the swap regret [5]. 

Our general tools, however, allow us to go well beyond finite sets of departure mappings. In the following 
sections, we consider several examples of infinite classes of departure mappings which have been considered 
in the literature. In some of these cases, an explicit strategy requires computation of a fixed-point [161 115) . 
Since we are not providing efficient algorithms in order to obtain bounds, we are able to get sharp results 
by directly focusing on the complexity of these infinite classes of departure mappings. 



5.1.3 Convergence to "^-correlated Equilibria 

A beautiful result of Foster and Vohra [TT] shows that convergence to the set of correlated equilibria can 
be achieved if players follow internal regret minimization strategies. What is surprising, no coordination 
is required to achieve this goal. Stoltz and Lugosi extended this result to compact and convex sets of 
strategies in normed spaces. In this section we show that their results can be improved in certain situations. 

Let us consider their setting in a bit more detail. Suppose there are N players each playing in a strategy 
set T . We could make the strategy set player dependent but it only complicates notation. There is N loss 
functions mapping a strategy profile (/i, . . . , /jv) to {ik{fi, ■ • ■ , /jv)}fcLu the losses for each of the N players. 
Consider a set of departure mappings $ C {</> : T — > J-}. A ^-correlated equilibrim is a distribution tt over 
strategy profiles such that if the player jointly play according to it, no player has an incentive to unilaterally 
transform its action using a mapping from <f>. That is, 

Vfc € [JV],V0 € <&, %!,..,/„)-* MfkJ-k)] < M<t>(fk),f-k)] ■ 

Theorem 18 in shows the following. If T is convex compact subset of a normed vector space, l^s are 
continuous and $ is a separable subset of C(J r ]|^] then there exist regret minimizing algorithms such that, 

3 The set of continuous function on T equipped with the supremum norm 



21 



if every player follows the algorithm then the sequence of empirical plays jointly converges to the set of 
^-correlated equilibria. 

Consider a particular player k. The regret minimizing algorithm for it is simply a $-regret minimizing 
algorithm with £{f,x) = x(f) where we have identified the adversary set X with the class of functions 
{/ i y ikifjd) '■ 9 € J 7 * -1 }, where g is a strategy profile over the remaining k — 1 players. Examining 
Stoltz and Lugosi's proof reveals that $ is taken to be a dense countable subset of $ and an explicit regret 
minimizing algorithm for countably infinite classes of departure mappings is used. The regret w.r.t. each 
4> £ 4> does go to zero but the rate is not uniform in (f>. In particular, it depends on the order in which the 
class <l is enumerated. Later, they also consider examples of uncountable classes $ of departure mapping 
where non-asymptotic rates of convergence for $-regret can be obtained. Specifically, they use the metric 
entropy of $. We show how to improve their bounds using sequential complexity. 

As an example, consider the case where J- is some compact subset of the unit ball in some normed space 
with a norm || • ||, the loss function £k is a 1-Lipschitz convex function, and the class <& of departure functions 
has finite metric entropy A/" m etric(^j ot) for all a > 0. Metric entropy is simply the log covering number where 
covers of $ are built for the supremum norm \\4>\\oo — suP/eJ 7 ll^(/)ll- Let us consider a typical situation 
where Mnetric(^) ce) — 0(l/a p ). To upper bound the 4>-regret we can always make the set of adversary's 
moves larger. In fact, we make set X — Cjr, where 

Cjr = {x : J- — > K : x convex and 1-Lipschitz} . 



Moreover, by Lemma 42 we have Vt(Cjf, J-, $) = Vt(Cf, T, <&) where 

Cjr = {x : T — > K : x linear and 1-Lipschitz} . 
Then the sequential complexity bound is 

1 T 

supE ei!T sup~ Ve t ^(f t (e)),x 4 (e)) . (10) 

(f,x) 06* J ^ 

Note that the set X is now just the set of 1-Lipschitz linear functions, i.e. elements in the unit ball of the 
dual space. Since ||0i — 4>2\\oo < a implies 

|<M/),s)-<&(/),s)|<a 

for any x *E X, we can use metric entropy inside Dudley's integral to upper bound the sequential complexity 
by 



cmi(aT + Vf[ \P^da'\ 



This bound behaves as 0(VT), if p < 2, as 0(- s /Tlog(T)) if p = 2, and as 0(T ( - P ~ 1 ^ P ) if p > 2. These are 
better than the general bound of 0(T^ P+1 ^^ P+2 ^) given in Example 23 of [26] . 



5.1.4 Linear Transformations 

In this section we consider the following scenario, discussed in [T5]. Suppose J 7 is a subset of a Hilbert space 
A4. Let $ be the set of Lipschitz linear transformations on J 7 , i.e. $ = {M G J- — s- J- : < R} for some 

operator norm || • ||. Let || ■ ||* be dual to || • ||. We are assuming the Online Convex Optimization scenario, 
i.e. X is a set of L-Lipschitz real- valued convex functions on T and the loss is defined as £(f,x) = x(f). 
Furthermore, 

£ lj>M (f,x)=x(Mf). 

Therefore, we are in the setting of the well-studied online convex optimization (possibly in an infinite- 
dimensional Hilbert space), yet instead of being compared to the value of the best fixed point /* in hindsight, 
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the player is being evaluated according to the best linear transformation of his trajectory fx, ■ ■ ■ , /t- Is this 
problem learnable? 

By Lemma |42| the value of the convex game is equal to the value of the associated linear game. Suppose 
functions x £ X have gradients bounded by L in the I2 norm. The value of the convex game is upper 
bounded by the sequential complexity of the class of linear payoffs i (/, x) = (f,x). Then the sequential 
complexity bound is 

1 T 

supE ei:T sup — V£ t (M t (e),x,(e)) , (11) 
(f,x) Me* J JT X 

which can be upper bounded by R ■ L ■ diani2 (J 7 ) . Note that these results hold in infinite-dimensional Hilbert 
spaces, where a metric entropy-type cover of T would not even be finite. 



5.2 Blackwell's Approachability 



Blackwell's Approachability Theorem [?J (201 El is a fundamental result for repeated two-player zero-sum 
games. By means of this Theorem, learnability (Hannan consistency) can be established for a wide array 
of problems, as illustrated in [8]. For instance, existence of calibrated forecasters can be deduced from 
Blackwell's Approachability Theorem [22 | 111 ) . 



Let us first discuss the relation of our results to Blackwell's Theorem. A proof of Blackwell's Theorem (see 
for instance [8]) reveals that (a) martingale convergence has to take place in the payoff space, and (b) the 
so-called Blackwell's one-shot approachability condition has to be satisfied. The former is closely related 
to the first term in our Triplex Inequality, while the latter is related to the second term (ability to play 
well if the next move is known). What is interesting, in the literature, Blackwell's Theorem is applied by 
embedding the problem at hand into an often high-dimensional space. The dimensionality represents the 
complexity of the problem, but this embedding is often artificial. In contrast, the problem complexity is 
captured by the third term of our decomposition, the sequential complexity, and it is explicitly written as a 
complexity measure rather than an embedding into some other space. The ability to upper bound problem 
complexity with tools similar to those developed in [25] (e.g. covering numbers) means that learnability can 
be established for a wide class of problems. 

In this section we show that Blackwell's approachability can be viewed as an online game with a particular 
performance measure (distance to the set). Using the techniques developed in this paper, we prove Blackwell's 
approachability in Banach spaces for which martingale convergence holds (Theorem 22 1. We also show that 



martingale convergence is necessary for the result to hold (Theorem 24 ) . To the best of our knowledge, both 
of these results are novel. 

To define the problem precisely, suppose T-L a subset of a Banach space B and S C B is a closed convex set. 
For the moves / £ T of the player and x £ X of the adversary, £(f, x) £ H is a Banach space valued signal. 
The goal of the player is to keep the average of the signals A Ylt=i ^(^*' x ') c ^ ose to the set S. To view this 
problem as an instance of our general framework, define 



B(z%, ...,z T ) 



mi 



1 T 



Zt 



The comparator term is zero by our assumption that $^ contain sequences (<pi, . . . , 4>t) of constant mappings 
which transform our actions to a point inside S: £(j> t (f,x) — c t £ S for all / £ J-, x £ X, and 1 < t < T. 
Thus, indeed, the performance measure is 



R/r = inf 

ces 



1 T 
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the distance to the set S. The next condition on the payoff I says that it must that the player can choose a 
"good" mixed strategy q in response to a given mixed strategy p of the adversary. This strategy q should, 
on average, put the payoff inside the set S. Recall that £(q,p) is simply a short-hand for the expected payoff 



£{f,x) (that is, we do not make any assumptions about linearity of £). 



Definition 13. Given a set S, the Blackwell's approachability game is said to be one shot approachable if 
for every mixed strategy p of the adversary, there exists a mixed strategy q for a player such that l(q,p) G S. 

Blackwell's one-shot approachability condition is akin the second term in the Triplex Inequality, where the 
order of who plays first is switched. If the one-shot condition is satisfied, it remains to check martingale 
convergence. 

Definition 14. We will say that martingale convergence holds if 

T 



lim supE 

T->oo M 



0. 



where the supremum is over distributions M of martingale difference sequences {d t }t<£n such that each 
d t € conv(-H [j —H) . 

We now show that, under the one-shot approachability condition, the set is approachable whenever martingale 
convergence holds in the subset of the Banach space. 

Theorem 22. For any game that is one shot approachable, we have that 

i n i T 

Vt(7,$t) < 4supE 



M 



1 T 



*=i 



where the supremum is over distributions M of martingale difference sequences {d t }teN such that each d t £ 
conv(-H \J-H). 

Proof. Now we apply Theorem [l] to the Blackwell Approachability game. Note that for any sequence 
(0i,..., <Pt), 4>t maps the payoff to some element of S. Hence, B(£ ( j, 1 (fi, x{), . . . ,£^ t (/t, %t)) = for 
any f±, . . . , fj< G J-, x\, . . . , xt € X. We then conclude that 

V T (4* T )< sup E ...sup E \B(£(f 1 ,x 1 ),...J(f T ,x T ))- E B(£(f[,x[), . . . J(f^x' T ))\ 

Pl,qi Jl~9l PT,qr JT~lT /J. T ~<?1:T ' 



(12) 



sup inf... sup inf E B(£(fi,x{), . . . ,£(f T ,x T )) 

p t 11 p T IT Jl:T~9l:T 



We remark for the upper bound to hold it is enough to assume that $^ contains some sequence that maps 
the payoffs to some element of S. 

Consider the two terms in the above bound separately. The first term can be written as 

T 



sup E ... sup E E < inf 

Pi,qi PT,qr fr~qTf[. T ~qi:T I ceS 



inf 

c'GS 



< sup E ... sup E E 

pi,9i Pt,<1t /r~9T/i. r ~^itr 



< sup E ... sup 

Pi, 9i /ii/i~9i pT,gr/' 



T T 

. e I iE^ 1 ')-^^. 1 *) 1 
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where in the first inequality we used inf a [(7i(a)] — inf a [62(12)] < sup a [Ci(a) — €2(0)] along with a triangle 
inequality. This is now bounded by 

1 T 



2supE 

M 



where the supremum is over distributions M of martingale difference sequences {d t } te p{ such that each 
d t e conv(-H \J-H). 



The second term in Eq. ( 12 ) is 

sup inf . . . sup inf E B(£(f 1 ,x 1 ), . . . ,£(f T ,x T )) 

pi 9i p T It /i:T~9i:T 



sup inf . . . sup inf E inf 

p x 11 p T IT /l:T~qi:T C^S 



X 1:T~P1:T 



< sup inf . . . sup inf E inf • 

Pj 11 p T IT fl.T~qi:T CgS 



< sup inf . . . sup inf ^ i 

pi 91 p T IT 



< sup inf . . . sup inf < inf 

pi 91 p T IT I CSS 



T T T ^ 

t=i t=i *=i j 

T T ^ 

t=l t=l J 



E 

/l:T~?l:T 
*1 : T~I>1:T 



- sup . . . sup E 

Pl,9l PT,qT fl-T~qi:T 



1 T 

t=i 

T 

rE^-ft)-?!)^' 1 *) 



(13) 



4=1 i— ! 



where the last inequality uses the fact that supremum is convex and infimum satisfies the following property: 
inf a [Ci(a) + C2(a)] < [inf a Ci(a)] + [sup a (72(a)]. By one shot approachability assumption, we can choose 
a particular response qt (in the first term of Eq. (13)) for a given p t to be the mixed strategy that satisfies 
£(qt,Pt) £ S. Since S is a convex set, we conclude that 



1 T 



and the first term in Eq. ( 13 ) is zero. The second term is trivially upper bounded as 

T 



sup . . . sup E 

Pi. IJl PT,9T /l:T~9l:T 

x 1:T ~p 1:T 



I^fe,^)-!^/^) 



< sup E ... sup E 

Pi,9i /i~9i Pt,<1t fT~qr 



T T 

yE^'^-rE^*'^ 



< 2supE 

M 



1 T 



Combining the two upper bounds yields the desired result. 



□ 



We now discuss lower bounds on the value of Blackwell's approachability game. The first lower bound is 
straightforward. 
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Proposition 23. Suppose martingale convergence holds. For any Blackwell's approachability game to have 
vanishing regret, one shot approachability for the game is a necessary condition. 



We now show that martingale convergence in the space of payoffs is necessary for Blackwell's approachability. 
To the best of our knowledge, this result has not appeared in the literature. 

Theorem 24. For every symmetric convex set H. there exists a one shot approachable game with payoff's 
mapping to H such that 




where the supremum is over distributions M of martingale difference sequences {d t }t<£N such that each dt G T-L. 



Proof. Consider the game where adversary plays from set X — H, the player plays from set J- = {±1}, and 
S = {0}. Suppose the payoff is given by t{f , x) = f ■ x. Now consider the adversary strategy where adversary 
fixes a H. valued tree x and at each time t picks a random e t G {±1} and plays Xt = 6fX t (/i -ei, . . . , ft-i ■ 
that is a random sign multiplied with the instance given by the path on the tree specified by f\ ■ ex, . . . , ft—x ■ 
e t _i. Further note that since e t £ {±1} are Rademacher random variables, we see that irrespective of choice 
of distribution from which f t is drawn, f t • e t is a Rademacher random variable conditioned on history. This 
shows that for the above prescribed adversary strategy, we have that for any X valued tree x and any two 
player strategies {g t *} and {q%} we have 



E ... E 

ei ~Unif{±l} e T ~Unif{±l} 



1 T 

j, ^Z(ft ■ e*) x (/i ft-l ■ e t-i) 



E ... E_ 

«l~Unif{±l} tT ^Unif{±l} 



f 



E 

tl ~Unif{±l} 



E 



E 



: T _ 1 ~Unif{±l} € T ~Unif{±l} 



1 T 

J 1 

t = l 



{ft ■ e*)x(/i ■ ex,.. . ,/t-i • e t _r) 



E_ 

81 ~U„if{±l} 



E_ 

e T ~Unif{±l} 



1 T 

J 1 

t=l 



{ft ■ e*)x(/i - ex,.. . ,/t-i • e t -x) 



The first equality above is due to the fact that fx • ct is a Rademacher random variable conditioned on 
/i, . . . , /t-i and ex, . ■ ■ , ct-i which means we can replace with q^. The subsequent equalities are got 
similarly by replacing each q* by g t * one by one inside out by conditioning on fx, . . . , ft-i and ex, ■ ■ ■ , et-ii 
and replacing each q\ by ql . Hence we see that the adversary strategy is an equalizer strategy. Hence using 
Proposition [T6] and picking the fixed / = 1 we see that 



V T > supE £ ^ Unif{±1} T 



1 T 

t=i 



e*x(e) 



> -supE 

2 M 



1 



T ^ 

t=i 



where the last inequality is because the worst-case martingale difference sequence generated by random signs 
(Walsh Paley martingales) are lower bounded by the worst case martingale difference sequences within a 
factor of at most two [Ml- D 



5.3 Calibration 

Calibration, introduced by Brier [7] and Dawid [5], is an important notion for forecasting binary sequences. 
In the context of weather forecasting, calibration means that, for the days the forecaster announced "30% 



2G 



chance of rain", the empirical frequency of rain should indeed be close to 30% [5J p. 85]; moreover, this has 
to hold for any forecasted value. The existence of calibrated forecasters, a fact which is not obvious a priori, 
was shown by Foster and Vohra [12] , Following [8J, we consider the notion of A-calibration. If a forecaster 
is A-calibrated for all A > 0, we say that the forecaster is well calibrated. 

In what follows, we formulate the calibration problem of forecasting {1, . . . , fc}-valued sequences in our 
general framework. In particular, we are interested in sharp rates on the resulting value of the calibration 
game, and we will compare our results with the recent work of Mannor and Stoltz |22j . 

Fix a norm || • || on K fc . Let H = K fc , T — A(fc), and X the set of standard unit vectors in R k (vertices of 
A(fc)). Define £(f,x) = 0; that is, the forecaster is penalized only through the comparator term. Wc define 

B(zi, . . . , zt) — — ^ Ym=i z t ■ Define $t = {{4>p,x, ■ ■ ■ , 4>p,\) '■ P £ A(fc), A > 0} to contain time-invariant 
mappings defined by 

e (t>p jf,x) = i{\\f- P \\<x}-(f-x) . 

This definition of the loss is indeed natural for the A-calibration problem. It says that, for any p chosen 
after the game, if we consider a round when the player predicted / € A(fc) close to p, the loss should be the 
difference between the actual outcome x and /. Indeed, when wc put all the definitions together, we obtain 



Rt = sup sup 

A>0peA(fc) 



I^l{||/ t -p||<A}-(/ t 



x t ) 



t=i 



Note that this notion of regret allows the worst scale A to be chosen at the end of the game. This makes 
it a stronger requirement than what is required for building a well calibrated forcaster. Nevertheless, we 
can bound the value of this game, improving on the results of Mannor and Stoltz [22] . Theorem 25 shows 
that the rate of calibration is 0(T -1 / 3 ) no matter what k is. The rate of <9(T -1 / 3 ) has been established for 
k = 2 previously. For k > 2, however, the best rates known to us (due to [55]) deteriorate with k. Let us 
remark that some looseness of the approach of |22j comes from discretization in order to phrase the problem 
as Blackwell's approachability. A reader will note that we also pass to a discretization in the proof below. 
However, this is done late in the analysis in order to upper bound the sequential complexity. This seems 
to speak in favor of our approach, aimed at directly looking at the complexity of the problem through the 
notion of sequential complexity. 

Theorem 25. For the calibration game with k outcomes and with l\ norm, we have that for T > 3 and 
some absolute constant c 



Vt(1,$t) < ck 2 



logT 
T 



1/2 



Proof. Let 6 > to be determined later. Let || • || denote the t\ norm. Let C$ be the maximal 2<5-packing 
of A(A?) in this norm. Consider the calibration game defined in Example |4j augmented with the restriction 
that the player's choice belongs to Cs instead of A(fc). The corresponding minimax expression with this 
restriction is clearly an upper bound on the value of the game defined in Example [4j 

Observe that the first term in the Triplex Inequality of Theorem[T]is zero. The second term is upper bounded 
by a particular (sub)optimal response qt being the point mass on pf , the element of Cs closest to pt- Note 
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that any 25 packing is also a 25 cover. Thus, the second term becomes 



sup inf . . . sup inf sup 



- E B{t^(f 1 ,xi),...,t^ T (fT,x T )) 

X 1[T ~Pl:T 
/ 1:T ~q 1:T 



sup inf . . . sup inf sup sup E 

Pl 9i p T It A>0pGA(fc) x ut~Vi-.t 

fl:T~H:T 



1 T 



< sup . . . sup sup sup E 

pi pt \>Q P eA(k) Xl - T ~P 1 - T 



±j2i{\\p s t -p\\<*}-(p 5 t-xt) 



which, in turn, is upper bounded via triangle inequality by 

T 



sup . . . sup sup sup E 

Pl PT \>0p£A(k) Xl - T ~P 1 - T 



< 25 + sup . . . sup sup sup E 

Pl PT A>0pgA(fc) X 1:T~P1:T 



Lj2i{\\ p s-p\\<x}-(p 5 t - Pt ) 
t—i 

1 f f2l{\\pl-p\\<X}-{ Pt -x t ) 

t=l 

±j2i{\\pl- p \\<\}.(p t - Xt ) 



sup . . . sup sup sup E 

Pi PT \>0p£A(k) Xl - T ~P 1 - T 



Now note that for a given A > 0, p%, . . . ,px and p € we have that {1 {\\pf — p\\ < A} ■ (p t — x t )}tef>i is 

a martingale difference sequence and so the second term in the triplex inequality is bounded as : 



sup inf . . . sup inf sup 

Pl 91 p T 9T 0g$ T 



X 1:T ~Pl:T 



< 25 + 2\ - 
T 



(14) 



We now proceed to upper bounded the third term in the Triplex Inequality. Since — B is a subadditive, by 
Theorem [2j we have that the third term is bounded by twice the sequential complexity 

m T (£,$ T ,-B) = 2sup E ei . T sup -B(e 1 ^ 1 (f 1 (e),x 1 (e)),...,e T ^ r (f T (e),x T (e))) 
f\x 4>e<s> T K ' 



— 2 sup E €l . T sup sup 

f,x ' A>0peA(fe) 



I^ et l{||f t (e)-p||<A}-(f t ( e )-x t ( e )) 



t=i 



where f is a C^-valued tree. Using the fact that f is a discrete-valued tree, not a A(/c)-valued tree, we would 
like to pass from the supremum over A > and p £ A(/c) to a supremum over finite discrete set in order to 
appeal to Proposition [6] 

To this end, fix f, x and ei-.T and let us see how many genuinely different functions can we get by varying 
A > and p £ A(k). This question boils down to looking at the size of the class 

Q ■= {g P Af) = 1 {11/ -Pll < A} : p G A(fc), A > 0} 
over the possible values of / e Cg. Indeed, if g Pi \{f) — g P >,\'(f) for all JeCj, then 



T T 

i ]T 1 {||f t (e) - p\\ < A} • (f t (e) - x t (e)) = I £ 1 {||f 4 ( e ) - ^1 ^ A '> ' &( e ) " x *( e )) 



t=i 



t=i 
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We appeal to VC theory for bounding the size of Q over Cs- First, we claim that the VC dimension of Q 
is 0{k 2 ). Note that Q is the class of indicators over l\ balls of radius A centered at p for various values of 
p, A. A result of Goldberg and Jerrum [14] states that for a class Q of functions parametrized by a vector 
of length d, if for g G Q and / G J-, 1 {g{f) = 1} can be computed using m arithmetic operations, the VC 
dimension of Q is 0(md). In our case, the functions in Q are parametrized by k values and membership 
11/ ~p\\i < A can be established in 0(k) operations. This yields 0(k 2 ) bound on the VC dimension of Q. By 
Sauer-Shelah Lemma, the number of different labelings of the set Cs by Q is bounded by | | c ' fc for some 
absolute constant c. We conclude that the effective number of different (p, A) is finite. Let us remark that 
the VC upper bound is not used in place of the sequential Littlcstone's dimension. It is only used to show 
that the set <&t is finite, and such technique can be useful when the set of player's actions is finite. 

Hence, there exists a finite set S of pairs (X,p) with cardinality \S\ < |C,5| c ' fc2 such that 



2*R T (t,<S> T ,-B) < 2 sup E ei T sup sup 

f,x A>0pGA(fc) 



^J2^{\Ue)-ph<X}-(f t (e)-M^)) 



2 sup E e , „ max 
f,x (p,A)es 



< 2 fc 1/2 supE e max 
f,x (p,a)€S 



i^ £t l{||f t ( £ )-p||i<A}.(f t ( e )-x t ( e )) 
t—i 

i ^ e t l {||f f (e) - pll! < A} • (f t (e) - x t (e)) 



Now note that || • ||| is (2, 2)-smooth and so applying Lemma[8]with G = \\ ■ Ha, 7 = 2, r\ = 2, we see that 



T J 
< 2fe i/2 ^ 16cfc 2 log(|^|) ^ 

;7: 3/ 2 f iog(i^i) y /2 



1/2 



for some small absolute constant c'. 

Now note that the size of set Cs the 25 packing of A(fc) is upper bounded by the size of the minimal 5 cover 
of A(/c) which can be bounded as \Cs\ < (|) " 1 and so we see that 



2<H T (£,$ T ,-B) < c'k 2 



log(l/g) 
T 



1/2 



Combining the above upper bound on the third term of triplex inequality and Equation 14 that bounds the 
second term of the triplex inequality (and since first term is anyway 0) we see that, 



V T < 26 

Choosing 6 = 1/T concludes the proof. 



2\l-+c'k 2 



log(l/<S)V /2 



T 



□ 
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5.4 Other Examples 



5.4.1 External Regret with Global Costs 



Let us consider a more general setting where the (vector) loss is £(/, x) rather than the specific choice f Ox 
in Example [5] The Triplex Inequality and Theorem [2] then gives 



Vt < sup E ... sup E E 

pi.iji PT,qr /t , ~9t/(. t ~<?1:T 

"1~P1 <ZT~PT a! l t. 



sup inf . . . sup inf sup 

pi 1i p T It feJ^f 



{T T 



2 sup E ei . T sup 



1 T 

-^6^(/,X t (6)) 



Consider the first term in the Triplex Inequality. Observe that (£(f t ,x t ) — £(fl,x' t ))J =1 is a (vector valued) 
martingale difference sequence and so 



sup E ... sup E 

Pi, 9i PT.qr ItJt~1t 



1=1 



< 2supE 

M 



1 T 



where the supremum is over distributions M of martingale difference sequences {d t }t<£N such that each 
d t € conv(-H \J-H). 

Now, consider the second summand above: 



sup inf . . . sup inf sup 

pi 9i Pt it f e j^f 



{T T 



sup inf . . . sup inf 

Pi 91 



E 



p T I /l;T~<?l:T 



1 T 



inf E 



< sup . . . sup 

Pi Pt 



E 

X 1:T ~Pl:T 



1 T 



*=i 



inf E 

feT Xl:T~Pl:T 



where in the last step a (sub)optimal choice was made for q t : the distribution q t = 5f t puts all the mass on 
ft such that 

\\£(ft,Pt)\\ = mf P(/,ft)||. 
Observe that by several applications of triangle and Jensen's inequalities, 



E 

Xl : r~Pl:T 



< 



4=1 
1 T 

^E^<^) 



inf E 



t=l 



inf 



E 

Xl:T~Pl:T 



1 T 



(15) 



Now we make an important assumption. 
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Assumption 1. Suppose that, for any p\,pi, 



infp(/ ! p 1 )+^(/,p 2 )||>infp(/,p 1 )||+infp(/,p 2 )|| 



Under Assumption [Tj along with the way we chose ft, the first term in (151 becomes 

T T T T 



t=i t=i 

We conclude that the second term in the Triplex Inequality can be upper bounded by 

T 

I 

sup . . . sup E 

pi p T Bl:T~PliT 



1 T 



which, in turn, is no worse than the supremum over distributions M of martingale difference sequences used 
to bound the hrst term. 



This gives us the general upper bound on the value of the game: 

T 



V T < 4supE 

M 



2 sup E Cl . T sup 



(16) 



Let us see what this implies in a specific case of interest. 



Global Cost Learning on the Simplex Here we consider Example [5j the setting studied in Even-Dar 
et al [10]. Let J- = A(fc), X — [0, l] fc and £(f, x) = f x. Let us first verify if Assumption [l] holds here. By 
linearity of the vector loss, we just have to verify whether, for arbitrary pi,p2, we have 

inf II g O pi + g O p 2 II > inf ||g pi + inf ||g0p2 ■ 

gGA(fc) — — gSA(fc) — gGA(fc) — 

where the notation pi stands for the mean of the distribution pi . This is equivalent to asking whether the 
function 

x i y inf 11/ x|| 



is concave. Lemma 41 in the appendix proves that it is. Note that in [TU], it is shown that the above function 
is concave for the £ p norms (including p — oo). It turns out that it remains concave no matter what norm is 
chosen. Thus, the general upper bound (16) holds. In the case we are considering, we can further massage 



the second term in that upper bound. Note that for any / and y, ||/ y\\ < \\f\\<x> \\y\\ < ||y||- Hence, we 
have 



sup 



lf>(/0x t (e)) 



sup 



V t=i j 



< 



1 T 

t=l 



e*x t (e) 



Hence using the above in ( 16 ) we see that 



V T < 4supE 

M 



< 6 supE 

M 



1 T 

t=l 
1 T 

t=l 



2 sup E £l ., 



j 1 Z-^/ 

t=i 



e*x t (e) 



where the last inequality is because (e(X t (e))^ 1 is a martingale difference sequence. In the last inequality the 
supremum is over distributions M of martingale difference sequences {d t }t£N such that each d t E [—1, l] fc . 
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5.4.2 Adaptive Regret 



To study online learning in changing environment Hazan and Scshadhri defined the notion of adaptive regret 
in |17j . The notion of adaptive regret introduced in [17] was mainly one where cumulative loss for any time 
interval is compared to the best predictor at hindsight for that particular interval. We first extend the notion 
of adaptive regret in [17] to include departure mappings as, 



\ \ loss (/t, x t ) - inf ^ loss (^ ° ft, x t ) > 

I t=r t=r ) 



R T := sup { — yjoss(f t ,x t ) - inf — >^loss(^o f t ,x t ) } (17) 

[r,s]C[T] 

where loss : T x X n- [0, 1] is some arbitrary loss function and <J/ is some class of departure mappings. The 
key idea in the above definition of regret is that we consider the worst time interval and consider the regret 
for that time interval versus some fixed set of departure mappings. 

We capture the above notion of regret in our framework by defining : 

• £(/, x) — for all / € T and x e X 

• Define the set of time-invariant payoff transformations $t —Tt x where = {(ip, ■ ■ . , ip) : tp G 
and ^ is some class of departure mappings and It — {([r, s], . . . , [r, s]) : [r, s] C [T]}, the set of all 
intervals in [T] repeated T times. 

• For each t S [T] and <j> t — {It, tpt), define l$ t (/, x) = (— loss(/, x) + loss(^ o /, a;)) 1 {t € J t } 
B(z 1 , ...,z T ) = Yl=i z t/ T 



Note that 



R T = sup < ^ V loss(/ t , x t ) - inf i V loss(V> o f t , x t ) \ 

f 1 T 1 T 

= su p \ ■^zZ^ os& {h,x t )l{t e I t } - - Vloss^t o f t , Xt )i{t e i t } 
/ez T ,v>e*T [ t = i J t=1 

= B(e(f 1 ,x 1 ),...,e(f T ,x T ))- inf B(t 4>1 (f 1 ,x 1 ),...,t <h .(f T ,x T )) 



and thus we see that the adaptive regret defined in Equation (17) falls under our general framework. We 
would like to point out as an example that if we take *S?t = {(/, ■ ■ ■ > f) ■ f € the time invariant set of 
constant mappings then the regret defined in Equation (17) is identical to the one in |17j . Below we show a 
bound on the value of the game with adaptive regret in terms of covering number of the departure mapping 
class. 

Theorem 26. For the adaptive regret game we have that 



V T < 8 inf L + 6V2 f ] JW^m d5 + 96 ^1 (18) 



6 High Probability Bounds 

The definition of value of the game provided in Equation ^ only guarantees existence of a randomized 
algorithm which in expectation over its randomization achieves regret bounded by the value. Even with 
Markov inequality this is not sufficient to prove almost sure convergence but only convergence in expectation 
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(or probability). We now define for any 9 > an alternative notion of a value of the game V^(£, $t)- It 
guarantees existence of a randomized online learning algorithm which in T rounds achieves regret smaller 
than 9 with probability at least 1 — V T (£, 3>t) over its randomization. Using this value we are able to prove 
almost sure convergence for many games. 

Definition 15. For any 9 > define the value of the game as 



Vt{1,$t) = infsup E ...infsup E 1 i sup {B(£(f x , xi), . . . ,i(f T , x T )) - B{£^ (/i, asi), . . . , £^ T (fr, x T ))} > 6 

11 xx fl~H IT x T /r~«3T I 0£$ T 

(19) 

It is natural to think of the sequence of infima, suprema, and expectations as a stochastic process which 
generates / t 's and x t 's. The "in-expectation" version of the value of the game, defined in Q, is the expected 
performance measure Rt under a draw from this stochastic process. The "in probability" Definition [15] is 
the probability that the performance measure Rt exceeds a threshold 9. 

The above value of the game is related to the expected version of the value of the game. To see this, note 
that whenever Rt is a non-negative random variable, by Markov inequality we can conclude that 

for any 9 > 0. Similarly if B is bounded by L then we can conclude that 

Vt(£,$t) < inf {9 + 2L V T (£,$ T )} • 

Since it is possible to bound expectation by integrating tail probabilities, we will sometimes get better bounds 
on the expected version of the value by integrating V^<(1, $t) with respect to 9. 

Note that bounding V^(£, $r) will guarantee, for a fixed T and 9, the existence of a player strategy whose 
regret against any adversary will not exceed 9 with high probability. Such a guarantee may already suffice 
in many cases. However, sometimes we want to prove the existence of Hannan consistent player strategies: 
player strategies for a game with infinitely many rounds t = 1,2,... such that Rt —> almost surely against 
any adversary. We will not pursue a formal development of such infinite round games here. Instead, we 



will show later (in Section 6.2) how the tools developed below allow us to prove the existence of Hannan 
consistent strategies for the calibration game. Similar arguments can be used to show the existence of 
Hannan consistent player strategies for other games provided some anaologue of the so-called "doubling 
trick" is available. 

The rest of the section is devoted to tools for bounding the value of the game as defined in Definition [15] 
First, we provide the probability version of the Triplex Inequality. 

Theorem 27 (Analogue of Theorem [TJ . For any 9 > 0, we have a probabilistic version of the Triplex 
Inequality: 

V e T {£, $ T ) < su P P D {B{t{f u x x ), . . .,£(f T , x T )) ~ B(£( qi , Pl ), . . .,£(q T , P T)) > 0/3) 

D 

+ sup inf ... sup inf 1 < sup {B(£(q 1 ,p 1 ), . . .,£(q T ,p T j) - B(£,f, 1 (q 1 ,p 1 ), . . . , ^ (q T , p T ))} > 9/3 > 
pi ii PT it [^ e $ T J 

+ supP D [ sup {B{£^ (qi,pi), ...,€&. (q T ,p T )) - B(i^ (fi, n), . . . , £^ T (f T , x T ))} > 9/3 ) 
d y</>e*T J 

where D ranges over distributions over sequences (xi, /i), . . . , (xt, /t)- 
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Note that D can be thought of as sequence of conditional distributions {(pt, Qt)}t=i> where pt : (J 7 , X) 1 1 i->- 
V. ,,, : ■:./•.. I' : • ► Q. 

We remark that the second term in the bound of Theorem [27] is deterministically either one or zero for a 
given 9. 

After the decomposition of Theorem |27| has been established, we turn to upper bounds on the three terms. 
Recall that, roughly speaking, the first term is typically bounded via martingale convergence, the second 
term is bounded by the choice of the best response to the strategy of the adversary, and the third term 
is bounded by sequential complexity. For the third term, we again apply the sequential symmetrization 
technique, but now in probability instead of expectation. This requires a bit more work. In particular, for 
the probabilistic version of Theorem [2] we first need the following mild assumption. We require that there is 
some T < oo such that for all T > T , for any fixed 4> € 

supFn[B(e^(q 1 ,p 1 )-^ 1 (fi ) x[),...,^ T (q T ,p T )-U T (f! r ,x' T ))>e/6 | (J u a*), . . . , (/ T) xr)) < 1/2 

(20) 

Here (f[,x'i), . . . , (/^, x' T ) is a sequence tangent to the sequence (/i, x\), . . . , (/t, xt), drawn from the distri- 
butions (qi,Px), . . . , (<Zt,Pt)- We remark that the assumption of Eq. (20) is mild and will always be satisfied 
(for T large enough) in the problems we consider. Indeed, the tangent sequence is independent, given the 
original sequence, and so (20) is a statement about the behavior of B for zero-mean independent random 
variables. 



Theorem 28. Suppose B is sub-additive. Fix 9 > and suppose T is large enough so that (20) is satisfied. 
Then the third term in the Triplex Inequality is bounded by 



4supP £ sup B(ei^ 1 (fi(e),xi(e)) ) ...,e T ^ r (fT(e),x T (e))) > 0/12 . 
x,f y0e* T / 

//, on the other hand, —B is subadditive, the third term in the Triplex Inequality is instead bounded by 
4supP e ( sup -B(e 1 ^ 1 (f 1 (e),x 1 (e)),..., eT £ 0T (f T (e),x T (e))) > 0/12) 

X,f \06*T / 



The following lemma is useful for bounding the first term of the Triplex Inequality in Theorem [27] when the 
function B is smooth in each of its arguments. 

Lemma 29. For any H-valued martingale difference sequence {z t }f =1 such that \\z t \\ < r\, if B : H T >— > K + is 
such that B q is {o~,p)-smooth in each of its arguments and if for all t €E [T], || ViB 9 (z 1 , . . . , 0, . . . , 0) || < 
R, then 



<(B(zi,...,zr)>e)<exp 



9? _ o-Trf/py 
2r] 2 R 2 T 



In particular, using Lemma 29 above we can upper bound the third term of the triplex inequality for finite 
sets of payoff transformations. 

Corollary 30. For any finite set of payoff transformations under the conditions of Lemma \25\ 

su P P D f sup B{l^{q 1 ,p 1 )-l^{f 1 ,x 1 ),...^ T {q T ,p T )-^MT^ T ))>e\ < |$ T | cxp (- ^" ~^}^ Ip) 



The above results hold under very general assumptions of smoothness of B. Stronger results are attainable if 
we make an additional assumption that B is a function of the average of its coordinates. The next subsection 
is devoted to this assumption. 
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6.1 When B is a Function of the Average 



Throughout this section, we assume that B is a function of the average of its coordinates: 

T 

T 



B(z 1 ,...,z T ) = G[^J2z t 



t=i 



The following upper bound can be derived. 

Lemma 31. Suppose G > is sub-additive, 1-Lipschitz in the norm \\ ■ \\, and G(0) = 0. Then 

su P P e L sup G ^I^ et ^ t (f t ( e ) )Xt ( e ))J > <M(0/2,$ T ,T) su P P e (g (^J2eM^ > 0/2 J 
where supremum on the right hand side is over %-valued trees. 

Lemma [3l] upper bounds the probabilistic version of sequential complexity by the size of an l\ cover times 
the probability that the norm of a martingale difference sequence generated by random signs is close to zero. 
When the norm in question is 2-smooth, we can invoke results on concentration of martingales due to Pinelis 
|23j . The results have been re-proven for general 2-smooth functions in the Appendix. 

Corollary 32. Under the assumptions of Lemma 31, if G 2 is (a,2)-smooth with respect to || • || and 
P</>(/> x )\\ ^ V f or a M §i /j x : then for any T > 9 /Ao~ , we have 

supP £ ^sup G^^e^ t (f t (e),x t ( e ))^ > 9^j < 2^(0/2, $ T , T) exp {"j^a} • 

When B is a function of the average of its arguments, Lemma [3T] and Corollary [32] allow us to control the 
third term in the Triplex Inequality by applying Theorem |28| Now, we would like to generalize the above 
results in two directions. First, we would like to obtain the Dudley integral-type upper bounds instead of 
the £i-cover at a fixed scale. Second, we wish to consider norms which are p-smooth for 1 < p < 2. Both the 
extensions enlarge the scope of problems that can be addressed and also make the upper bounds sharp. 

We start by considering the real-valued case with the goal of obtaining upper bounds using the chaining 
technique. 

Proposition 33. Suppose T~L C [—1, 1]. We have that for any 9 > ^JS/T, 

P e (sup ^y>^ t (f t (e),x t ( e )) >inf(4a + 120 / y/logAf^S^T^ds] ) < L exp{-T0 2 /2} 

\0£*T 1 7^ a I J a ) J 

where L is a constant such L > Y^jLi ■Mx>(2~ J ) $Ti T) • In particular, for time-invariant constant departure 
mappings, 

P e ^sup^5Ze t /(x t (e)) > inf |4a + 120 J y/logAf^S, T, T)d6^j <L exp{-T9 2 /2} 
Furthermore, we have, 

Pe ^supi^e t /(x t (e)) > 128 m T (T) (l + 9^T log 3 (2T) jj <L cxp{-T9 2 /2} 
where 9\t(F) is the sequential Rademacher complexity of J- as defined in 
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The next lemma generalizes Proposition 33 to 2-smooth norms. Its proof is almost identical to that of 
Proposition 33 and will be omitted. 

Lemma 34. Assume that G > is 1-Lipschitz w.r.t. norm \\ ■ ||, sub-additive, G(0) = 0, and G 2 is (a, 2)- 
smooth. Further, suppose that for any x^X,f<EF,4>£ <&t an d t S [T], it is true that ||^ t (/, x)|| < 1. 
Then for any 9 > W8cr/T : 



TO 2 



P E f sup G fi5Zet^ t (ft(e),x t (e))J > inf + ^ yflo&M„(6,* T ,T)Myj < Lexpj- 
where L is a constant such L > 2 Y^jLi ■Mx>(2~ J ! T)^ 1 . 

We now turn to the goal of proving upper bounds for general p-smooth norms. The following lemma is the 
main building block for Lemma [36} It provides a large deviation inequality for (Walsh-Paley) martingale 
difference sequences in a (cr,p)-smooth Banach space. As such, it may be of independent interest. 

Lemma 35. Let (£>, || • ||) be a (o~,p)-smooth space. Let x be any B-valued tree of depth T with ||x 4 (e)|| < R 
for any t,e. For any v > 8ct^p log 3/2 T/T 1 - 1 ^, we have that 



T 



4=1 



>128 ^^^ + 128 vR \ <2exp 



2<j 2 /p log 3 T 



With the above concentration inequality in hand, we can now derive a Dudley integral type bound when H. 
is a subset of a (<7,p)-smooth space. 



Theorem 36. Assume that G > is 1-Lipschitz w.r.t. norm \\ ■ \\ and that (£>, || • ||) is a (o~,p)-smooth space. 
Further, suppose that for any x € X, f G T , (p £ $t and t € [T], it is true that \\£<j> t (f,x)\\ < 1- Then for 

102 4er 1/p log 3/2 T . 



any > 



< Lexp 



2rp2-2/p 



65536 a 2 /? log 3 T 



where L is a constant such L > 2 X)^=i -Mx> (2 J , ^t,? 1 ) . 



6.2 An Almost-Sure Bound for Calibration 

For the calibration game, using the tools developed above, we first show the existence of a player strategy 
guaranteeing small regret with arbitrarily high probability. 

Theorem 37. For the calibration game with k outcomes and with l\ norm, we have that for any 9 > ^ , 

V e T < 8exp (-^^ +cfc 3 log(T)) (21) 



where c is a fixed numerical constant. The inequality (21 1 above can be restated as: For any r\ 6 (0, 1), there 
is a player strategy such that, with probability at least 1 — r), 



R T <48y fclog(8/ ^ + efc41ogr 

for T > 3. 
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Proof of Theorem \37\ The proof is similar to that of Theorem [25j with the exception of controlling 
appropriate quantities in probability in stead of in expectation. We consider the value of the game V^(£, $t) 
as in Definition 15 for some 9 > 0. Let 5 > to be determined later. Let || • || denote the £\ norm. Let 
Cg be the maximal 2J-packing of A(A') in this norm. Consider the calibration game defined in Example [4] 
augmented with the restriction that the player's choice belongs to Cs instead of A(/c). The corresponding 
minimax expression with this restriction is clearly an upper bound on the value of the game defined in 
Example [4j 

We now use the probabilistic version of the Triplex Inequality defined (Theorem 27). Observe that the first 
term in the Triplex Inequality is zero. The second term is upper bounded by a particular (sub)optimal 
response q t being the point mass on pf, the element of Cs closest to p t - Note that any 25 packing is also a 
25 cover. Thus, the second term becomes 



sup inf . 


. . sup inf 1 < sup {- 


Pi 9 1 


PT IT 


1 </>e$T 


< sup . . 


. sup 1 < 


r 

sup sup 


Pi 


Pt 


A>0pGA(fc) 


= sup . . 


. sup 1 < 


r 

sup sup 


Pi 


PT 


A>0peA(fc) 



< 1{<S > 6/3} 



t=i J 
l£l{||pf-p||<A}.(^-p t ) 



>#/3 



We now proceed to upper bounded the third term in the Triplex Inequality. If T is large enough such that 
the conditions of Theorem [28] are satisfied, the third term in the Triplex Inequality is upper bounded by 



4supP e sup sup 

x,f yA>0pGA(fe) 

since — B is a subadditive. 



l£e t l{||f 4 (e)-p||<A}.(f t (e)-x t (e)) 



t=i 



> 0/12 



Note that f is a C^-valued tree, not a A(fc)-valued tree. Using this fact, we would like to pass from the 
supremum over A > and p £ A(/c) to a supremum over finite discrete set. 

To this end, fix f, x and ei-.T and let us see how many genuinely different functions can we get by varying 
A > and p £ A(k). This question boils down to looking at the size of the class 

Q ■= {g P ,x(f ) = 1 {11/ ~P\\ < A} : P G A(fc), A > 0} 
over the possible values of / £ Cs- Indeed, if g P ,\(f) = g P ',\'(f) for all / £ Cs, then 



- £ 1 {||f t (e) - p\\ < A} • (f t (e) - x t (e)) = - £ 1 {||f t (e) - p'\\ < A'} • (f t (e) - x t (e)). 
t=i t=i 

We appeal to VC theory for bounding the size of Q over Cs- First, we claim that the VC dimension of Q 
is 0(k 2 ). Note that Q is the class of indicators over £i balls of radius A centered at p for various values of 
p, A. A result of Goldberg and Jerrum [T?] states that for a class Q of functions parametrized by a vector 
of length d, if for g £ Q and / £ J 7 , 1 {g(f) = 1} can be computed using m arithmetic operations, the VC 
dimension of Q is 0(md). In our case, the functions in Q are parametrized by k values and membership 
11/ ~ p\\i ^ ^ can be established in 0(k) operations. This yields 0(k 2 ) bound on the VC dimension of Q. By 
Sauer-Shelah Lemma, the number of different labelings of the set Cs by Q is bounded by | | c ' fc for some 
absolute constant c. We conclude that the effective number of different (p, A) is finite. Let us remark that 
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the VC upper bound is not used in place of the sequential Littlestone's dimension. It is only used to show 
that the set <&t is finite, and such technique can be useful when the set of player's actions is finite. 

Hence, there exists a finite set S of pairs (X,p) with cardinality \S\ < \Cg\ c ' k such that 



4supP e sup sup 

x,f \A>0peA(fc) 



< 4supP e max 
x.f \(p,A)eS 



I^e t l{||f t (e)-p||<A}.(f,(e)-x t (e)) 
t=i 

I^e t l{||f t (e)- P ||<A}-(f 4 (e)-x t (e)) 



< 4|5|supP £ 



1 T 

j 1 



t=i 



> 6>/12 

> 0/12J 



> 6»/12 
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where the supremum is over all 2_B 1 fc -val ued binary trees of depth T, where B\ is a unit l\ ball in 
that the || • ||i < Vk\\ ■ \\2- By Corollary 

> 0/12 ) < 



Note 



V t=i 



( fJ2*Me) > 0/(12\/fc)j < 2exp (- 



T{e/\2f 



16k 



Now note that the size of set C$ the 28 packing of A(fc) is upper bounded by the size of the minimal 5 cover 
of A(fc) which can be bounded as \C$\ < (|) 1 and so we see that 



4| SI sup I 



V t=i 



> 0/12 < 



exp 



T(9/uy 



8 exp 



5 J ' \ 16k 

T(0/12) 2 



16k 



+ ck 3 log(l/<5) 



Combining everything we see that, 



V e T < 1 {5 > 0/3} + 8 exp (- T{ - 6 /}? )2 + ck 3 log(l/<5) 

V 16k 



Choosing, 5 — 1/T gives 



Vt < 1{1/T > 0/3} + 8cxp - 



T{6/12f 
16k 



+ ck 3 log(T) 



which gives the first statement of the theorem. 

We now rewrite the result in terms of a fixed probability of deviation. To this end, set 

W12) 2 . „, 3l 



V 

8 = 6XP V 16* 



+ ck 3 log(T) 



which gives 



= 48 



fclog(8/?7) + cfc 4 log T 



T 



Note that for any T > 3 and r\ E (0, 1), we have 



T > 



16 2 (fclog(8/7?) + cfc 4 log T) 
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Hence we conclude that for any 77 S (0, 1), we have with probability at least 1 — t], 

lk\og(8/n) + ck 4 \og T 



R T < 48 



T 

□ 



The above result almost suffices to get a result stating almost sure convergence. The only issue is that the 
player strategy guaranteed above depends on the confidence level n. In the proof of the following result, we 
show how to achieve small regret uniformly for all confidence levels 77. Then, it is fairly easy to show the 
existence of a Hannan consistent strategy for the calibration game. 

Theorem 38. Suppose the calibration game is played for infinitely many rounds T = 1, 2, . . .. Then there 
exists a player strategy such that against any adversary we have, 



JT 

lim sup — = • Rt < 60 almost surely 



T ^°° • '3k log(2T) + £§- log(T) 



The proof of Theorem 38 can be taken as a general recipe for proving almost sure bounds (and, therefore, 
Hannan consistency). The idea is to lift the dependence of the in-probability value Vf- (as well as player's 
strategy) on 9 by instead considering a closely related value of the form Eexp {i^R^} for some appropriate 
T-dependent factor K . Whenever this value is bounded, Markov's inequality gives tail bounds for a strategy 
that does not depend on 9. Together with a doubling trick, this leads to an almost sure convergence guarantee. 
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Appendix 

Proof of Theorem^ The value of the game, defined in |2]), is 

V t (*,$t) =inf sup E ...infsup E sup {B{£{h, Xl ), . . . ,£{f T , x T )) ~ ^(^(/i, xi), . . . , U T (f T , x T ))} 

= supinf E ...supinf E sup {B(£(f x , Xl ), . . . ,£(f T , x T )) - B(£^(f x , x{), . . . ,£$ T (f T , x T ))} 
pt 11 fi~qi p T it f T ~qT cj,e<s>T 

via an application of the minimax theorem. Adding and subtracting terms to the expression above leads to 



Vt(£, $t) = supinf E ...supinf E 

P! 91 /l~9l p T It /t~9t 



B(£(f 1 ,x 1 ),...J(f T ,x T )) 



E B(£(f[,x[), . . . ,£(f' Tl x' T )) 

/ 1:T ~9l:T 



sup < 



E B(i(f 1 ,x , 1 ),...,£(f^x' T ))-B(£^(f u x 1 ),...,£^(fT,x T )) 

/l :T ~<?l:T 



< sup inf E ... sup inf E 

P! 9i /i~9i p T qT fT~qr 



B(£(fi,xi), . . . ,£(f T ,x T )) - , E B(£(f[, Xi), . . .,£{f T , x' T )) 

/ 1:T ~9l:T 



c l iT ~Pl:T 



+ sup e {BW/;, I ' 1 ),...,i(/;,4))-B(^(/;,4),..,^(/;, I y)} 



= l :T ~Pl:T 



+ sup { E B(^ 1 (/{, 2 :' 1 ),...,^ T (^,^))-S(^ 1 (/ 1 ,a: 1 ),...,^ T (/ T ^ T )) 

</>G<t>T /l : T~9l:T 



"l:T~ P l:T 



At this point, we would like to break up the expression into three terms. To do so, notice that expectation 
is linear and sup is a convex function, while for the infimum, 



inf [dia) + C 2 (a) + C 3 (a)} < 



sup Ci(a) 



+ 



infC 2 (a) 



supC 3 (a) 



for functions C±, C2, C3. We use these properties of inf, sup, and expectation, starting from the inside of the 
nested expression and splitting the expression in three parts. We arrive at 

< sup sup E . . . sup sup E \B{i{h,x 1 ),...,£{f T ,x T ))- E B(£(f[,x[), . . . ,£(f^,x' T ))] 

PI 91 /l~91 PT IT JT~QT L /i.-r~9l:T J 

* 1:T ~!>1:T 



+ sup inf E ... sup inf E 

P1 <31 /l~81 p T IT /t~9t 



sup sup E ... sup sup E 

Pi 11 pt It St~1t 



sup E {B(f(/;, j!), ■ • ■ , «(/t, ^t)) - (/!, i'i), • ■ • , ^ T (/r, %))} 

4>£&T fl-.T^Ql-T 



" UT ~P1:T 



sup < 



, E B(l4, 1 (f[,x'i),...,£ 4 , T (f T ,x T ))-B(£ < i >1 (f 1 ,xi),...,£ < i >T (fT,XT)) 

f\:T~ q l-T 



The replacement of infima by suprema in the first and third terms appears to be a loose step and, indeed, 
one can pick a particular response strategy {ql } instead of passing to the supremum. For instance, this 
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can be the best-response strategy for the second term. However, in the examples we have considered so far, 
passing to the supremum still yields the results we need. This is due to the fact that the online learning 
setting is worst-case. 

Consider the second term in the above decomposition. We claim that 



sup inf E ... sup inf E 

pi 91 p T It fr^QT 



sup E [B(£(f[,x[),...,e(f^xi r ))-B(i^(f[ > x[),...J^MT^T)^ 

<f>£$ T /l:T~'? 1:T 
X 'l:T~ P l:T 



= sup inf ... sup inf sup E [B(£(fx, x x ), . . . ,£(fr, x T )) - B{i^(fx, xx), . . . , ^ t (/t, %t))] 

Pl 11 p T IT cj>£$ T fl:T~qi:T 

I 1:T~P1:T 

because the objective 

E [B(£(fi,x[),...J(f^,x , T ))-B(^ 1 (f[,x[),...J^MT^T))} 

J 1:T ~<?1:T 
*' 1:T ~P1:T 

does not depend on the random draws fx, X\, . . . , /r, Xt- We then rename //, x' t into ft,%t- This concludes 
the proof of the Triplex Inequality. □ 

Proof of Theorem [2| We turn to the third term in the Triplex Inequality. If B is subadditive, 

e fl(^(/!,i;) ^(fT^TD-^t/^il.-Afc^)) 

/ 1:T ~<?1:T 
x' 1 . T ~Pl:T 

< E B(£ < j >1 (f[,x' 1 ) — £<f, 1 (fx,Xx), . . . ,£<t> T (fTi x T) ~ ^4>tUtiXt))- 

f l:T ~qi:T 

If, on the other hand, — B is subadditive, 

E B{£^ (f[,x[), ...,U T (f T , x' T )) - B(£ c f >1 (fx, Xl ), . . . ,£$ T (/t, x t )) 

/ 1:T ~<3l:T 

<- E B(^(/i,ii)-^^4... 1 ^(/T,iT)-^(/i. 1 ^)). (22) 

/ 1:T ~9l:T 
*>' 1 . T ~Pl;T 

Below assume that B is subadditive, and the proof of the other case is identical. 

To prove the bound on the third term in terms of twice sequential complexity, we proceed as in |25j , applying 
the symmetrization technique from inside out. To this end, first note that, 



sup E ... sup E sup E B(£ r j, 1 (f[,x' 1 )-i 4>1 (f 1 ,x 1 ),...,£ rj>T (f! r ,x' T )-£ r f >T (fT:XT) 

pi,qifi~H Pt,1t /t~9t <be$ T ,/t~9t v 

x!~pi = T ~ PT 4~ P1 ,...4~ PT 

< Sup E ...Sup E SUp BU^ 1 (f[,x[)-£^ 1 (fx,Xx),...,£ t ) lT (fT,XT)-£4> T (fT,XT) 
Pl,qi Pt,9t Jt J t ~9t 0g$ T \ 



the above is true because the expectations are pulled outside the suprema, thus resulting in an upper bound. 
Now notice that conditioned on history fx, f' T are distributed identically and independently drawn from qr- 
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Similarly xt,x' t are also identically distributed conditioned on history. Hence renaming them we see that 
E sup B{t 4>1 (f[,x' 1 ) -ifaifi^! ),..., i^ T (fT,x' T ) -UtUt^t)) 

]t,S t ~<It rf,£<S> T V / 

= e sup B(eM,x[)-£ < p 1 (f 1 ,x 1 ),...,e 4>T (fT,XT)-e <t , T (fT,x' T )) 

St >/t~?t </>g$ t v ' 

= E sup Bk(/i,a:i) - ^(/^i), . . . ,-(^ T (/^;) - ^(/t,x t ))) 



where only the last argument of B is changing sign. Thus, 

E sup BU^{f[,x\)-t^{f 1 ,x 1 ),...J^ T {f! T ,x' 1 )-l^ T {f T ,x T )) 



= E £T E sup B(^ 1 (/(,a;i)-^ 1 (/i,xi),...,e T (^ r (/^,a;^)-^ T (/ T ,a; T ))) 



where £t is a Rademacher random variable. Furthermore, 

Sup E SUP Bf^y;,!!)-^!/!,!!),....^^.^)-^^^^ 

PT,<lT fT,f T ~qT </>6$ T ^ ' 

= sup ( E E £T sup ^(^(/{^iJ-^t/i,!!),...,^^^,^)-^^,^))) 



< 



sup E £T sup B(^ 1 (/{,o;i)-^ 1 (/i,a;i),...,e T (^ T (/r,a; / T )-^ T (/ T ,a; T ))) 



it 



Proceeding similarly notice that since given history Xt-i,x' t _ 1 and /t-i, /t-i are distributed independently 
and identically we have, 



sup E sup E Er sup 

PT-1.9T-1 /r-l./r-l~9T-l i Tl i^,g^ 06*T 



sup E E, ,. , sup E,.,, sup 

PT-1.9T-1 /t-1./t_i~9T-1 i t ,i^,6A; 06* t 



B (t<l>Afu x 'i) - ^(/ljXi), . . . ,e T -i(^ T (/-r-i,a;T-i) - ^ T _ x (/t-i, xt-i)), £t(^ t (/t, Xt) - ^, t (/t, xt))) 
< sup E CT _ 1 sup E £T sup 

B^UAfi'Xl) - UA.h,Xl),.. . ,eT-l(U T _AfT-l,x' T -l) ~ £<PT-AfT-l,XT-l)),£T(tj> T (fT,XT) - Ut (fr, xr)fj 

Proceeding in similar fashion introducing Rademacher random variables all the way to t\ we arrive at 
sup E ...sup E sup Bf^^/Jjsi) - ^(/i.ii), . . • ,^ T (/T,a/ r ) - ^ T (/T,sr)) 

< sup E ei ... sup E £T sup Bfei^^/i,^) - UAh> x i))i ■ ■ ■ , ^(^tUt, x 't) ~ ^> t (/t, zt))) 
xi,x[ex x T ,x' T ex 4>e$>T 
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Subadditivity of B implies B(a — b) < B(a) + B(— 6), and thus 

B^M,^) - UAft,xi)), ■ ■ ■ , eAUAfr, xr) ~ UAfr, x T )j) 
< B(ei^ (/{, xi), • ■ ■ , erU T {&, x't)) + ^( - (/i, • • • , -er^ (/r, a*)) 
We, therefore, arrive at 

sup E £l ... sup E eT sup BUiii^i^^^) - £^(fi,xi)), . . . ,e T (i^ T {fT,x' T ) - e^ifTyXx))) 
xi,x[ex x T ,x' T ex 4>e<&T v ' 

<2 sup E ei ... sup E er s,up B[eil < j }1 (f 1 ,xi),...,e T £^ T (f T ,x T ) 
SieT.xxex f T er,x T ex </>e$ T v 

= 2 sup E ei . T sup B(ei^ 1 (fi(e),xi(e)), . . . ,e T ^ T (f T (e),x T (e))) 
(f,x) 0e* T v ' 

where in the last step we passed to the supremum over (J 7 x A")-valued trees. This concludes the proof for 
the case of B being subadditive. Starting from Eq. (22 ), the proof for the case of — B being subadditive and 
convex in each of its coordinates leads to the bound of 

2 sup E ei . T sup -B(ei^ 1 (fi(e),xi(e)), . . . ,e T ^ T (f T (e),x T (e))j. 
(f,x) 4>e$ r v ' 

The complete proof can be repeated for the first term in the Triplex Inequality in order to bound it by 
2?t T (£,Z, B) (or respectively 2<H T (£,Z, —£?)). □ 

The following Proposition is immediate from the definition of a smooth function via successive expansions 
of each coordinate around zero. 

Proposition 39. Assume function B : T-L T i— > R is (a, p) -uniformly smooth in each of its arguments and 
that B(0,0,...,0) = 0. Then 

T T 

B( Zl , . . . , z T ) < V (V t B(zi, . . . , zt-i, 0, . . . , 0), zt) + V -\\z t \\ p 
t=i t=i y 

Lemma 40. Assume that for some q > 1, B q is (tr, p) -uniformly smooth in each of its arguments and 
B(0, . . . , 0) = 0. Then we have that 



sup E ...sup E B(zi - z[,...,z T - z' T ) < {{2rj) p aT/p) 



1/9 



where the maximization is over distributions pt with support in the ball rj ■ B\\.\\ of radius 77. 
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Proof of Lemma 40 By Proposition 39 we have that 



sup E ...sup E B q (z 1 — z[, . . . , zt — z' T ) 

pi zi,zJ~Pl p T z t ,z' t ~Pt 

{T T 1 

(v t s«(zx - z [, . . . , Zt ^ - 4^,0, . . . ,o),zt - 4) + - J2 II* - 4V \ 
t=i p t=i J 

<sup E ...sup E \Y,(VtB q (zi-z[,...,z t - 1 -4_ 1 ,0,...,0),z t -z' t )\ 

pi z u z[~ pi pT z T ,z' T ~p T I ~ x I 

+ sup E ...sup E |-V||^-^irl 

pi z 1 ,z' 1 ~p 1 pT z T ,z' T ~p T I p ^— ^ I 

E \-j^\\z t -z' t \A<(2r 1 YaTlp 



sup E ... sup 

pi zi.z'^pi p T Z T 



Since q > 1, by Jensen's inequality we conclude that 



sup E ...sup E -zi,...,z T -4) < {{2r]) p aT/p) 



i/'l 



Proof of Lemma [#| The proof follows immediately from Lemma 40 
Proof of Lemma^ By Proposition [39| we have: 



□ 
□ 



B q (ei^ (^(e), Xl (e)), . . . , e T ^ T (f r (e), x T (e))" 

T T 

< <V t B 9 (e 1 ^ 1 (f 1 (e),x 1 (e)), . . . , (f^e), x^e)), 0, . . . ,0), e t ^ t (f t (e),x t (e))> + - ^ ||^ t (/ t , x t )|| ! 



t=i 



< Y <*9t {Ui (fi (e) , xr (e) ),..., (f t (e) , x t (e))) + <nfT/p 
t=i 



where in the last line we used the definition of g t as well as an upper bound on the norm. Now by Jensen's 
inequality we get 

sup E ei . T sup B(e 1 ^ 1 (f 1 (e),Xi(e)),...,e T ^ T (f T (e),x T (e))J 

/ t \ 1 li 

< sup E ei . T sup Ve t5t (^ 1 (f 1 (e),x 1 (e)),...,^ t (f t (e),x t (e))) +arfT/p) 

/ T \ V<2 

< sup E 61:T sup Ve t5t (^ 1 (f 1 (e),x 1 (e)),...,^ t (f t (e),x t ( £ ))) + {arf /pf^T 1 ^ 



□ 
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Proof of Proposition^ Fix a T x A"-valued tree (f, x). Note that 

| 5 i(^ 1 (f 1 (e),x 1 (e)),...,^ t (f t (e),x t (e)))| 

< ||V t B 9 (e 1 £ 01 (f 1 ( e ),x 1 (e)),...,e t _ 1 ^ t _ 1 (f t _ 1 ( e ),x t „ 1 ( e )),O,...,O)||j|^ t (ft(e),x t (e))|| 

< R- v 

Using Lemma [5] 

T 

E ei-.T max V e t 9t(Ui ( f i( e ), x i( e )), • ■ ■ , Ut ( f t( e ), x *( e ))) 
t=i 



< 



, 21og(|$ T |) max max V fft (^ 1 (f 1 (e),x 1 (e)), . . . ,^ t (f t (e),x t (e)))^ 
\ <pe*Tee{±i} J ^— j 

< VV^" 2 log(|$ T |)T 

Now using Lemma [4] we obtain the desired result. 



□ 



Proof of Corollary^ To appeal to Proposition [6j we need to specify smoothness parameters. It can be 
verified that if G q is (7,p)-smooth in its argument, then B q is (7/T p ,p)-smooth. Furthermore, 

\\VtB"( Zl ,...,z T )\\, <p/T. 

The bound of Proposition [6] then becomes 



$K T (£,$ T ) < 



□ 



Proof of Lemma [#| The lemma follows directly from Theorem 46 To see this, just recall the definition 
of <Kt(^, $t): 



3tr(4# r ) = 8upE e sup G f i V e t ^ t (f 4 (e), x t (e)) ] 
f,x 0e* T \ t=L / 



For any fixed pair (f , x) of trees, the argument of G above is the sum of martingale difference sequences 
coming from a finite family. The step size bound B = rj/T and smoothness constant (7 = 7. □ 

Proof of Lemma [PI For any T and X- valued trees (f , x) , 



(23) 



where suprcmum is over H-valued trees such that ||zt(e)|| < r\. Further, by Lemma 35 for any 

v > 8ct77 1/p log 3/2 T/T 1 - 1 /? } 

we have that, 



- ^e t z t (e) 



> rr + H < 2 exp 



,2^2-2/p 



2c 2 7 2/ P7? 2 log 3 y 
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Plugging this into (231, we get, 



v < 2|<I> T | exp 



v 2 T 2~2/p 



2c 2 7 2/ P?7 2 log 3 j, 



By a standard argument (e.g. Lemma 47 1 to integrate out the tail, we get 

/ 1 T \ i/p 

\ s e u P GU^e t ^ t (f t (e),x t ( e ))j <|^(l + 21og 3 / 2 r(Vlog(2|a> T |) 

Making trivial over-aaproximations when T > 3 > e and > 1 gives the result. 



□ 



Proof of Theorem \ 1 0[ Define (3 = 1 and /3j = 2~ J . For a fixed tree (f,x) of depth T, let Vj be an 
^oo-cover at scale fy. For any path e G {±1} T and any (/> € let v[(f>, ep € Vj a /3^-close element of the 
cover in the £oo sense. Now, for any € $r, 



G ( ^E^ t (f t ( e ),x t ( e ))) < G (I^ et (^(f t ( e ),x t (e)) - v[0,e]f )) + E G ( ( v ^ e l 

V 4=1 /\ t=l / j = l \ 4=1 

T N / T 

< - £e t (^ t (f t ( e ),x t (e))-v[<Mf) + £ G - £ e t (vfe, eft - vfo e 

t=l 3=1 \ 4=1 

< max ||^ (f t (e), x 4 (e)) - v[0, e]f || +J2 G [fJ2 e]{ - v[0, e^ 1 )) 



j'-i 



Thus, 



sup G[if>^ t (f t (e),x t ( e ))] sup J £ G ( ± £ e t (v[0, e]> - v[<f>, e]?'" 1 ) J 

We now proceed to upper bound the second term. Consider all possible pairs of v s € Vj and v r e Vy_i, 
for l<s<|Vj|, l<r< |Vj_i|, where we assumed an arbitrary enumeration of elements. For each pair 
(v s ,v r ), define a real-valued tree w( s ' r ) by 



( s i r ) i \ 
; (e) 



v®(e) — v[(e) if there exists <fr <E <&t s.t. v s = v[0, e] J ',v r = v[(/>, ep 1 
otherwise. 



for all t S [T] and e € {±1} T . It is crucial that w( s,r ) can be non-zero only on those paths e for which v s and 
v r are indeed the members of the covers (at successive resolutions) close in the sense to some <fi € 
It is easy to see that well-defined. Let the set of trees Wj be defined as 



{w^:l< a <|^|,l<r<|^-i|} 



Using the above notations we see that 



sup G[ i^ et ^ t (f t (e),x t (e))j 



</3jv+E e 



£ sup G(if>w|(e)] 

j=1 vwew/ 3 V 4=i / 



(24) 
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From the way the trees in Wj are constructed, it is easy to see that max tg [ T ] ||w|(e)|| < 3/3j for any w- 7 £ W J 
and any path e. Using Theorem |46[ we get 



E, 



su P G( i^e t ^ t (f t (e),x t ( e )) 



N 



T 

N 



< 



7log(2|^|-|VS--il) 
T 



.7 = 1 



3= 



Using standard arguments to move from the discrerized sum to an integral, this gives the bound, 

2V7 f 1 



inf 4a + VI / ^logA^^, $ T ,T)d/3 . 



□ 



Proof of Corollary \11\ The first statement is trivially verified. In fact, for this to hold we only require that 
B is subadditive, affine in its arguments, and .6(0, ... ,0) = 0. Indeed, the expectations can be sequentially 
moved inside of B, making the coordinates of B zero, and making the suprema over the distributions 
irrelevant. 

For the second claim, consider the second term in Q, specialized to the case of departure mappings: 

sup inf... sup inf sup E \ ^ VV(/ 4) x t ) - &{<t>t{ft), %t) \ (25) 

pi 91 p T IT </, S $ r fl:T~qi:T 1 f~ * 

Pick a particular (sub)optimal response q t which puts all mass on / t * = argminjgjr E, x ^ Pt £(f, x). It follows 



that £(ft,x t ) — £(4> t (ft),Xt) < 0, ensuring that the quantity in (251 is non-positive 



The third claim is a straightforward consequence of Theorem 10 Indeed, H C [—1, 1] and G(x) — \x\ which 
is non-negative, at 0, Lipschitz and G 2 is (2, 2)-smooth. □ 

Proof of Lemma \15\ Fix an (F x A")-valued tree (f, x) of depth T. Let («o, • • • , ik) be the sequence 
which defines intervals of time-invariant mappings for the sequence (<f>i, . . . , 4>t)- Fix e S {if} 7 "- Let 
v l ° , . . . , v 4fc e V be the elements of the cover closest to cf>i , ... , <fii k , respectively, on the path e. That is, 
for any a € {i , . • . ,ik}, 

m ax||^(f i (e),x t ( e ))-v t a (e)|| < a. 
By our assumption, on any interval J, defined by the endpoints a — ij and b = 

max ||^ a (f t (e),x 4 (e)) -^ 0t (f t (e),x t (e))|| < a, 

t£{es,...,o— 1} 

Hence, 

max ||^(f t (e),x t (e))-v?(e)||<2a 
te{a,...,&-i} 

Denoting by a(t) S {i , . . . , i^} the left endpoint of an interval to which t belongs, 

max ||^ t (f t (e),x t ( e ))-v? (t) (e)|| < 2a 
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It is then clear that to construct a 2a-cover for $ T ' a in norm, it is enough to concatenate trees in V. 
More precisely, this is done as follows. Construct a set V k of "H-valued trees as 

V k = {v' = v' (v°, . . . , v fc , i , . . . , i k ) : 1 = t < ii < . . . < i k < T, v°, . . . , v fe e V} 

and v' = v' (v°, . . . , v fe , i , . . . , is defined as a sequence of T mappings 

v ;(e)=v t o(t) (e) t€l a(t) 

for any e G {±1} T . Here 7 a = {ij, . . . , ~ 1} and a(t) is the index of the interval to which t belongs. In 
plain words, we consider all ways of partitioning {1, . . . ,T} into k + 1 intervals and defining a new set of 
trees out of V in such a way that within the interval, the values are given by a fixed tree from V. As before, 
it is clear that 



JV 00 (2a,$*' Q ,T) = \V k \ < Q ■Af 00 (a,$,T) k+1 , 



providing a control on the complexity of □ 
Lemma 41. Let T be the probability simplex in any dimension. Let || • || be any norm. The function 

x i y inf ||/ x\\ , 

f 

defined on the positive orthant, is concave. 

Proof. Since the function above is absolutely homogeneous and continuous, all we need to prove is 

tof ||/ (x + y)\\ > mf ||/ 0x11 + mfj|/0y|| . 

for arbitrary x,y. That is, for arbitrary f,x,y, 



\\f®(x + y)\\>mfJf®x\\+M\\fQy\\ . 



Define h, g £ T as follows: 



_ fti^+Vi/xj) _ f l {l + x l /y l ) 

where 

Z 9 = M 1 + Vi/ X i) Z h = Yl M 1 + X */Vi) ■ 
i i 

Now, as we show below, 1/Z g + \jZy l < 1. Thus, 

WfQ(x + y)\\>^\\f@(x + y)\\ + ^\\fQ(x + y )\\ 
= \\gQx\\ + \\hQy\\ 

> M\\fQx\\+ M\\fQy\\ . 
To finish the proof, note that, by Cauchy-Schwarz, 

2 

= 1 . 
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This shows, 



Similarly, we get 



Adding them, we get 



Zg ^ Xi + TJi 



V, 



Xi + yi 
1 '<£/. = ! 



Zg Zh 

as claimed. This completes the proof. □ 



Proof of Proposition 1 6 Consider any equalizer strategy {p$ } for the adversary. Note that 

V t (£,$t) =inf sup E ...infsup E sup {B{i{f x , Xl ), . . . ,*(/ r , st)) - B^Cfi^i), . . • ,l$ T (f T ,xr))} 



>inf E inf E . . . inf E \B (£(f u Xl ), . . . , £(f T , x T )) - inf B(i^(f lt xi) 

qi Xx~pl q 2 x 2 ~P2 9T x T ~p* T 0G* T 



,^0t(/t,^t)) 



= E ... E < B (£(/, xi), . . . , £(/, xt)) — inf B {i^ (f,x\), . . .,1<^ T (/, xt)) 

xi~pi x t ~Pt y 0G$r 

where / e J 7 is any arbitrary choice fixed before starting the game and p t = p* ({f s = /, x s }^ =1 ) is defined 
by the equalizer strategy. □ 

Lemma 42. For any departure mapping <&t and any L > we have that 
Proof. Note that for any convex x±, . . . ,xt we have that 

T T T 



V]^t(/t) - inf y^xti^oft) = supV - x t {(j)o f t )) 

T 

<sup^(Vx t (/ t ),/t-^o/ t ) 



* e * t=i 

T T 



= E<V* t (/ 4 ),/ t }- inf X;<Vx t (/ t ),0o/ t ) (26) 



t=i r t=i 



For any adversary strategy x* — (x\,...,Xj~) where each x^ : J- 1 X and any player strategy /* 
(/*,..., f£) where each / t * : X^ 1 n- J", by Equation (26) we have that 

T T T T 

E (V^Cft), /*) - K E < v ^(/ t )> o f t ) > E *t(/t) - inf E x ^ ° /*) 

4=1 4=1 {=1 4=1 
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where in the above, f t = f* ({Vxi(f 1 ), •),..., (Vx t -i{f t -i), •)) and x t = x* t (/i, . . . , f t ). Now if we take /* 
and x* to be the minimax optimal strategies then we see that 

T T 

V T (Lr, T, $ T ) > V (Vx t (f t ),f t ) - inf V (Vx t (f t ), o / t ) 

T T 

>£>(/,)- inf T^o/t) 



Thus we see that the value of the linear game upper bounds the value of the Lipschitz convex game. In 
fact the above argument shows that any strategy that provides vanishing regret guarantee against linear 
adversary provides vanishing regret gaurantee (with same rate) against convex Lipschitz adversary. This 
means that all that one needs to do to solve convex Lipschitz optimization optimally is to be able to solve 
online linear optimization optimally and also be able to calculate sub-gradient of a given function at any 
desired point. 

Further since the set of linear functions is a subset of the set of convex Lipschitz functions we can conclude 
that 

Hence we conclude the required statement that the value of the linear game is equal to the value of the 
convex Lipschitz game. □ 

Lemma 43. Consider a game where player plays from set T adversary from set X and we are give a linear 
B, loss £ and transformation set <&t- Assume that there exists a set X' , loss function £' and transformation 
set &' T such that for any 4> & ®t there exists <p' £ &' T such that for x £ X and f £ T there exists an x' £ X' 
such that for any t £ [T], 

£(f,x)-eMx)<e'(f,x')-£^(f,x') 

In that case we can conclude that value of the first game is bounded by value of the second game played with 
T, X', B, £', & T , that is 

Vt(£^t^,X) < Vt^'^'t,?,*') 



Proof. By assumption that for any <p £ <1>t there exists <f)' £ <&' T such that for x £ X and / e J there exists 
an x' £ X' such that for any t £ [T], 

£(f,x)-£Mx)<£'(f,x')-£ tt , l (f,x') 

We can conclude that since B is linear, for any <f> £ $t there exists 4>' £ $>' T such that for any /i, . . . ,/t 
and xi, . . . , xt we have that for the corresponding x{, . . . , x' T given by our assumption, we have that 

B(£(f 1 ,x 1 ),...,£(f T ,x T ))-B(£^(f 1 ,x 1 ),...,£ <i ,AfT,XT)) 

<B{£\h,x' 1 ),...,£\f T ,x' T ))-B{£^{h,x' 1 ),...,£^{fT,x' T )) 

Hence we can conclude that 

sup {B{£{f uXl ), £(f T , x T )) - B{£^ (h, Xl ),..., £^ T (f T , x T ))} 

< sup {B{£\h,x\),...4\fT,x l T))-B{£^{h,x l 1 ),...,£ 4>lT {fT,x' T ))} 
<P'£$' T 
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Now say q* — (q*,... ,q^) where each q% : (J- x X')* 1 i-> A(J r ) is the minimax optimal strategy for the 
player while playing the second game. Also let p* = (p*, . . . ,Py) where each p£ : (J 7 x A?)* i-> A(Af') be the 
minimax optimal strategy for the player while playing the first game. In this case we see that 

V T (t,*T,?,X)= .E ••• E sup {B(£(f u x 1 ),...J(f T ,x T ))-B(i <t>1 (f 1 ,x 1 ),...J <l>T (f T ,x T ))} 
fi^Qi fT^qT <pe®T 

< E .. E sup {B(£\f 1 ,x[),...J\f T ,x' T ))-B(£^(f 1 ,x[),...J^ T (f T ,x' T ))} 
/i~«J ST~q T(f>l ^' T 

□ 

Proof of Theorem \26\ We start by applying the Triplex inequality in Theorem [T] along with Theorem [2] 
we get that : 



V T < 2ft T {£,l,B) + sup inf... sup inf sup <( - E B(^(/i, a*), . . . ,^ t (/t, ir)) > + 2*R r (*, $t, S) 

pi 91 p T 9T l*e#T fl:T~qUT 

\ ®UT~Pl:T ) 

= + sup inf . . . sup inf sup J E i V (loss(/ t , a*) - loss(^ o / t , x t )) 1 {< G I t } > + 2*R T (£, $ T , B) 

Pl 91 p T 9T (J, e $ T /l:T~!l;Ti f~ * 

^x 1:T ~p 1:T J 

where the last inequality above is because the first term of the triplex inequality is as B is linear (see 
11). If we use qt to be point mass on f t — argmin E Xt ^ Pt [loss(/, Xt)] we see that the second term 



Corollary 

of the triplex inequality above is bounded above by 0. Hence we can conclude that 

T 



V T < 2*R T (i,®T,B) = 2supE £ 

f,x 



sup \ XI e * ( loss ( f *( e )> x t( e )) ^ loss (V> o f t (e), x t (e))) l{te [r, s]} 



V>G*,[r,s]C[T] T " t 



To bound the above we use Corollary 11 (noting that i^ t {f,x) <E [—2,2]) to get 



< 



8 inf L + *ftf yiog^(^,T) + log(M 



dS 



Now note that \Xp\ < T 2 and so we get that 



V T < 8 inf ^ + 6V2 / 2 J** a \ + 96a r 



Q>0 



T 



T 



We conclude that whenever covering number of VP can be bounded appropriately, adaptive regret can be 
bounded at the expense of an extra O ( \ / m % A ) term. □ 
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Proof of Theorem 21 For any 9 > 0, the value of the game V^(£, $t), defined in ( 15 ), is 

v T (£, *t) 



inf sup E ... inf sup E 



1{ sup {B(^/i,xi),.-.,/(/r,»T))-B(^ l (/i,n),...,^(/!r,!tr))}> 



= sup inf E ... sup inf E 



1<{ sup {B(£(f 1 ,x 1 ),...,£(f T ,x T ))-B(U 1 (f 1 ,x 1 ),...,U T (f T ,x T ))}>8 



via an application of the minimax theorem. Adding and subtracting terms to the expression above leads to 
Vt(£,$ t ) = supinf E ...sup inf E [l{B(£(f 1 ,xi),...,l(fT,x T ))-B(£(q 1 ,pi),...,i(qT,PT)) 

Pl Ql fl~qi p T QT /t~<3T 

+ sup {B(£(q 1 ,p 1 ),...,£(qT,PT))-B(£ tj , 1 (f 1 ,x 1 ) 1 ...,e tj , T (fT,x T ))} > 
<supinf E ...supinf E [1 {B(£(f u Xi), . . . , £(f T , x T )) - B(£( qi , Pl ), . . . , £(q Tl p T )) 

pi 11 /l~<Jl p T IT /t~9t 

+ sup \B(£(qi,pi),...,£(qT,pT))- B(^ 1 (q 1 ,p l ),...J t j >T (q T ,PT))\ 

+ sup {^(^(gi.pi),.. . ,U T {qr,PT)) - B(£^ (fi,xi), . . . , £^ T (f T , x T ))} > 
<supinf E ...supinf E [1 {B(£{f u xi), . . . , £{f T , x T )) - . . . , £{q T ,Pr)) > 0/3} 

P1 11 /l~<Jl p T IT /t~9t 



+1 I sup \B(£( qi , Pl ), . . .,£(q T ,p T )) - B^^gi.pi), . . . , U T (q T , Pt))\ > 0/3 



+1 \ sup {BfoifaiPi),.. . ,U T (qr,PT)) - B^Cfi,^),.. . ,^ t (/ t ,^t))} > 0/3 

At this point, we would like to break up the expression into three terms. To do so, notice that expectation 
is linear and sup is a convex function, while for the infimum, 



inf [Ci(a) + C 2 (a) + <7 3 (a)] < 



sup Ci(a) 



infC 2 (a) 



supC 3 (a) 



for functions Ci, C 2 , C3. We use these properties of inf, sup, and expectation, starting from the inside of the 
nested expression and splitting the expression in three parts. We arrive at 



V T {£, $ T ) 

<supsup E ...supsup E [l{B(£(f 1 ,x 1 ) y ...,£(fT,x T ))-B(£{q 1 ,p 1 ),...,£(q T ,PT))>e/3}] 

Pl Ql ^l^ 1 ?! Pt QT fT^lT 



- sup inf E ... sup inf E 

P1 Ql fl~qi p T QT fT~QT 



+ sup sup E ... sup sup E 

Pl Ql p T q T I't^QT 



1 { sup {B(£( qi , Pl ), . . .,£(q T ,p T )) - B{£^ (qi, P i), . . . , V(9t,Pt))} > 0/3 



1 { sup {Bt^^pi), . . . ,t<p T (qT,PT)) - B(£^(fi,xi), . . .,^(/t,zt))} > 0/3 



As mentioned in the corresponding proof of Theorem [TJ the replacement of infima by suprema in the first 
and third terms appears to be a loose step and, indeed, one can pick a particular response strategy {g t *} 
instead of passing to the supremum. 
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Consider the second term in the above decomposition. Clearly, 



sup inf E ... sup inf E 



1 < sup B(e(q 1 ,p 1 ), . . .,£(q T ,Pr)) ~ B(t^ (qx,Px), . . . ,^ t (?t,Pt)) > 6/3 

</>e*r 



= sup inf ...sup inf 1^ sup B(£(q 1 ,p 1 ),...,£(q T ,p T ))-B(£ ct , 1 (q 1 ,p 1 ),...,£^ T (q T ,p T ))>9/3> 

Pl 1i p T It [<£e*T J 

because the objective does not depend on the random draws. □ 

Proof of Theorem \28\ Assume that B is sub-additive (the other case is identical) . 

B(£^ 1 (qx,Px),-.-,£<t> T (QT,pT)) - B(£ t f >x (f 1 ,xi),...,£ <l , T (f T ,x T )) 
< B(£ tj>1 (q 1 ,p 1 ) - Ifaifuxx), . . -,£<$,■?($?, Vt) ~ £$tUt,xt)) 

By our assumption we have that for any distribution D and any fixed <j> € $t, 

Vu(B(£^(q 1 ,p 1 )-£^(f[,x[),...,U T (q T ,p T )-U T (fT,x' T ))<e/6 \ {fi,x 1 ),...,(f T ,x T )) > \ (27) 
For a given (fx, Xx), . . . , (fx, Xt), let <jf € $ be the transformation defined as 

<f>* = argmax B (£ 4>1 (q 1 ,p 1 ) - £^{fi,xx), . . . ,£<$, t (c£t,Pt) ~ £</> T (fT,x T )) 

(We are assuming for simplicity that the supremum is achieved; otherwise, we can easily modify arguments 
to take care of it). Since <fr* is fixed given (/i, Xx), . . . , (fx, it), using Equation (27 1 we get 



< P D (B (£ n ( qi , Pl ) - £ n (f[, x[), . . . , % ( qT ,p T ) - £r T (f' T , x' T )) <6/6 I (fi, Xl ),..., (f T , x T )) 



Define set 



A = { ((f 1 ,xi),...,(f T ,x T )) 



sup B (£fc(qx,Px) - £^(fx,xx),.. .,£<j, T (qT,Pr) - £<f, T (fT,x T )) > 6/3 



Since the above inequality holds for any (fx,xx), . . . , (/t,£t), we assert that 



1 



- <P D (B (£ ri (qx, Pl ) - £ ri (f[,x[), . . . : £ rT (q T ,p T ) - £ rT (f T: x' T )) <6/6 \ ((f 1 ,x 1 ), . . . , (f T ,x T )) e A) 



It then follows that 
1 



sup B(£ !j , 1 (qi,p 1 )-l ! i >1 (f 1 ,xi),...,t 4 , T (qT,PT)-e<t >T (fT,XT))> 9/3 



<P sup B(£^ 1 (q 1 ,p 1 )-£ <j>1 (f 1 ,x 1 ),...,£^ T (qT,PT)-U T {fT,x T ))> 9/3 



x P (B (£ .(gi,pi) - e^(f u x'i), ■ ■ ■ ,£r T (lT,pr) - %(/t,*t)) < 9/6 | ((/ 1 ,n),..,(/ T ,i r ))El) 
< P (b(1^ ( qi , Pl ) - ^.(/i.ai), . . . (q T ,p T ) - %(/t, xt)) 

-B{£ n ( qi , Pl ) - £ n {fux'x), ...,e rT (q T , P T)- (/t, x' t )) > 9/6) . 
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By subadditivity of B, the above expression is upper-bounded by 

P (f[, x\) - I4,. (A, xi), . . . , % (/t, it) - % (/t, it)) > 0/6) 



<P( sup B^^/i^i) - ^(/l.aii), . . . ,U T {f T ,x T ) - lj, T (fT,x T )) > 6/6 



Hence, 



supP D ( sup 5(^(91,^1) - ^(/ljXi), ■ ■ - ,^ t (?t,Pt) - l<j, T (fT,x T )) > 6/3 

< 2supP D sup B^^/^xi) - ■ ■ ■ ,UtUt,x' t ) ~ UtUt.xt)) > 6/6 

d V <j>e® T 

= 2 sup E ...sup E 1 <M sup B(£^ 1 (f[,x' 1 ) - ^(fuxi), . . . ,U T (fT,x' T ) - ^ t (/t,it)) > 6/6 

31, PI "l,<"i~Pl q T ,PT X T< X T ~PT [ \0e*T 

Next, introducing a Rademacher random variable e T , the above quantity is equal to 



2 sup E ... sup E E, T 

X 1' X 1^P1 <?T>PT x T, x ' t ~PT 



sup B^iCfi.xl) - e^if^Xi), . . . ,e T (e<p T (fT,XT) ~ UtUt,xt))) > 6/6 



We pass to an upper bound by taking supremum over (/t,xt), (/t,It) : 
2 sup E ... sup E sup E,.,, 

/t-1./t_i~«T-1 



1 <{ sup (/1, Si) -^i(/i,a;i), • • ■ , £t(4t(/t, Zt) - UAh^xr))) > 0/6 



Repeating the process from inside out, we arrive at the upper bound 
2 sup E ei . . . sup E £T 



1 •( sup B(ei(£ 01 x-l) - (fi,xi)), e T {U T Ut, x t ) - Ut (fr, x T ))) > 6/6 



which can be written using the tree notation as 



1 { sup B(ei(4 1 (f{(e),x' 1 (e)) - ^ 1 (fi(e),x 1 (e))), . . . , e T (^(^(e), x T (e)) - ^ T (f T (e),x T (e)))) > 0/6 



2 sup E e 

f,f',x,x' 

= 2 sup P e I sup B(ei(4 1 (f 1 (e),xi(e))-^ 1 (fi(e),x 1 (e))),...,eT(^ T (fT(£),x T (£))- ^ T (f T (e),x T (e)))) > 0/6 
f,f,x,x' y&ei'r 

Next, using subadditivity of B, the last quantity can be upper bounded by 

2 sup P e ( sup {B(ei^ 1 (f 1 (e),x;(e)),...,e T ^. r (fT^ 
f,f,x,x' \^e* T 

<2 sup ip e ( sup B(ei4 1 (f 1 (e),xi(e)),...,eT^ T (fT(e),x T (e))) > 0/12 

f,f',x,x' I \0g* T 



+P £ ( sup B(-ei^ 1 (fi( e ),x 1 (e)),...,-e T V( f T(e),XT(e))) > 0/12 
= 4 sup P e I sup B(ei^ 1 (fi(e),x 1 (e)),...,e T V( f T(e),XT(e))) > 0/12 

f,x \(pG^T 

concluding the proof. □ 
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Proof of Lemma \29\ By Proposition [39] and the Azuma-Hoeffding inequality for real- valued martingales, 
P(B(z 1 ,...,z T )>6)=P(B«{z 1 ,...,z T )>6'>) 



< exp 



{6 q -aTrf/pY 
2rfP?T 



□ 



Proof of Lemma \31\ Fix (f,x) and let V = {v 1 , . . . , v N } be a minimal £i-cover of $t on (f, x) of size 
N < N\{8/2, $t, T). Let v[0, e] £ V denote a member of the cover which is close to ^ € 3>t on the path e. 
By sub-additivity of G, 

P £ ^sup G ^I^ e ^ f (f t ( e ), Xt ( e ))^ > 

<P e (sup |G?fi53ct(^(f t (e),Xt( e ))-v[^c]t))J +G^£etv[0,e] t H>^ 

Using the Lipschitz property of G along with G(0) = and triangle inequality, we can upper bound the last 
quantity by 

< P £ (sup G fl^e t v[<MtJ > 0/2^ , 
where the last step follows by the definition of the cover. The last quantity can be upper bounded by 



max 

vGV 



G U E > 6 i^j ^ E [f E e * v *( £ )) > / 2 ) 

< \V\ su P P e ^G ^ £tZ *( £ )) > / 2 ) 



where the supremum is over all "H-valued binary trees z of depth T. 

Proof of Corollary \32\ Follows directly by combining Lemma [3~T| with Corollary [45] 



□ 
□ 



Proof of Proposition \33\ Define (3q = 1 and (3j = 2~- 7 . For a fixed tree (f, x) of depth T, let Vj be an 
£oo-cover at scale f3j. For any path e <E {±1} T and any <fi £ <!>y, let v[(f>, e] J £ Vj a /3j-close element of the 
cover in the ioo sense. Now, for any £ 

T T NT 

l£e t ^ t (f 4 (e),x t ( e )) < I^ et (^ t (f t ( e ), X4 (e))-v[0 !e ]f) ± ]T e ( (v[0, e]| - v[</>, e^ 1 ) 



(=i 



3=1 



t=l 



JV 



< max |^ 4 (f t (e),x t (e)) - v[0,e]f | + ]T 



3=1 



{=i 
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Thus, 




We now proceed to upper bound the second term. Consider all possible pairs of v s e Vj and v r e Vj-i, 
for 1 < s < \Vj\, 1 < r < where we assumed an arbitrary enumeration of elements. For each pair 

(v s ,v r ), define a real-valued tree w( s,r ) by 

( s , r )^ | vf (e) — Vj(e) if there exists G $t s.t. v s = v[0, ep, v r = v[0, ep -1 
1 otherwise. 

for all t e [T] and e e {±1} T . It is crucial that w( s ' r ) can be non-zero only on those paths e for which v s and 
v r are indeed the members of the covers (at successive resolutions) close in the (too sense to some <fi € $t- 
It is easy to see that well-defined. Let the set of trees Wj be defined as 

W j = {w^:l<s<\V j \,l<r<\V j _ 1 \} 

Using the above notations we see that 



sup 



1 T 

^e^ t (f t (e),x t (e)) 



t=i 



AT 



< j3 N + sup < E 



AT 



< /3at + sup 



t=i 

T 



t=i 



(28) 



It is easy to show that max te [ T ] |w| (e)| < 3/3j for any w- 7 € Wj and any path e. 

In the remainder of the proof we will use the shorthand jV (X) ( j 8) = A/' (X) (/3, T). By Azuma-Hocffding 
inequality for real- valued martingales, 

Hence by union bound we have, 

T 



P £ sup 



^ E *tM (c) > e ft v/logA^^-) ) < 2^x> (ft ) 



2 j T^logAUft) 
exp 



and so 



3j e [N], sup 



jE^wJ( £ ) >^, v /logA4 (ft)) ^f^U/J,) 8 cxp |_ ^ 2l °g^oo(ft) | 



Hence clearly 



E sup 

= 1 wJ£W 3 



jEeXW >0Eft v /logA4 o (ft)] <2EAA co ( / 3 J ) 2 exp {- Tg2l °gfe^ 
J t=i j=i / j=i 1 ^ 
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Using the above with Equation ( 28 1 gives us that 



P £ I sup 



T N \ JV ( 

-^e t ^ t (f i (e),x t (e)) > N + 0^/^/logAU&) < 2 J^ooCft) 2 exp - 

t=l j=l J j=l 

< 2^exp|log7V 00 (/? i ) (2 
.7=1 ^ ^ 



T9* log AT^p, 



T£2 



Since we assume that 2 < ^j-, the right-hand side of the last inequality is bounded above by 



N 

2^ exp 



TO 2 log ^(fry 



I < 2^ exp i -— - logAUft) I < 2e-V ^A^^)" 1 
J j=i !• J ,=1 



By our assumption that 5Zi=i-M>o(/3j) 1 < ^ for some appropriate constant L, we see that 



sup 



if>**(f t (e),x t (e)) 



> p N + 0^/3^1ogA/' oo (/3 7 -) ] < Le 
3=1 



Now picking N appropriately and bounding sum by integral we have that 

AT 



(3 N + 9 J2 Pj v/logAU^) < inf 1 4a + 120 J ^logM^dS^ 



Hence we conclude that 

T 



sup 



I^ et ^ t (f t ( e ),x t (e)) 



> inf \ 4a + 129 \ y/logM^, $T,T)d5 \\ < Le 



The last statement the Proposition follows from the fact that the Dudley-type integral 



inf 1 4a + 120 J ^/logN^S, $ T ,r)tw| 



can be upper bounded by 



1 + 4\/2WTlog 3 (eT 2 ) < 128 1 + 6»A/Tlog 3 (2T) 



times the sequential Rademacher complexity. The proof can be found in 



□ 



Proof of Lemma 35 Let 



/ 1 T \ 1 T 

f^2 e t*t( e ) >csupE -2^e t x f (e) 

V t=i x L *=i 



be the norm dual to || • ||. First note that 

T 



1 + eJTlog^T 



T 

1 1 1 
U/iii.. <i i 'r - : i x 



. id:|H|„<1 



t=l 



sup ^e f (w,x t (e)) 



u;:||u?|L<l , 



t=l 



l + ^A/Tlog^T 
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Now, by Proposition 33 for payoff functions i(f,x) — f(x) = (f,x) and class <£>t being the time-invariant 
constant departure mapping class, by noting that sup x E 



1 v-T 



S t =i e t x t( e ) = ^riJ 7 ) we get that 



( I T [ i T 

r^ e * Xt< ^ >csupE yH e * x *( e ) 

V t=i x L *=i 



> c supE 

x 

where c = 128. Now note that for a (a, j?)-smooth space we have that 
supE 



l + 6>\/Tlog 3 T < Lcxp(-T8 2 /2) 



1 f; etXi (e) < ^ ( _L S upf] E [||x t ( e )H ) 

i=l J V x t=l / 



1/P 



< 



a 1/p R 



Moreover, the linear class J 7 has covering numbers satisfying Afoo(/3) > 1/(3 and hence L < 2. Thus, 



\ t=i 



> c 



l + 6»Y / Tlog 3 T < 2exp(-T6» 2 /2) 



Now setting v = 6a 1/p \/T log 3 T/T 1 " 1 ^ gives the required bound as, 



T 



t=i 



> c 



T 1 - 1 /p 



cvR < 2 cxp - 



j/irpZ-2/p 

2c7 2 /p log 3 T 



The condition > y8/T on (from Proposition 33) implies that the above is valid only for 



v > 



Sct 1 '? log 3/2 T 



□ 



Proof of Theorem \36\ Define /3o = 1 and /3j = 2~ J . For a fixed tree (f, x) of depth T, let Vj be an 
^oo-cover at scale j3j. For any path e g {±1} T and any <p £ <I>t, let v[</>, e]- 7 £ Vj & /3j-close element of the 
cover in the £oo sense. Now, for any <p £ <E>t, 



sup G(l^e t ^ t (f t (e),x t (e))] 

= |G^I^ et ^ t (f t ( e ),x t ( e ))J -G^5>v[«Mf) 

+ 1 [ G [f t 4) - G (f£ ^ ^) ) 

i^e t (^ t (f t (e),x t (e))-v[^e]f) 



< sup 



N 

£ 

3=1 



JV 



< sup < max||^ t (f t (e),x t (e)) - v[0,e]f || +^ 
0e$ T | tei-rj 



< ftv + sup < Y] 



3 = 1 
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Consider all possible pairs of v s £ Vj and v r € Vj-i, for 1 < s < |Vj|, 1 < r < |V/_i|, where we assumed an 
arbitrary enumeration of elements. For each pair (v s , v r ), define an H-valued tree w^' r ^ by 

( s ,r), | v t( e ) — v [( e ) if there exists </> £ $ T s.t. v s = v[0, ep,v r = v[</>, e]- 7-1 
1 otherwise. 

for all t G [T] and e g {±1} T . It is crucial that w( s,r ) can be non-zero only on those paths e for which v s and 
v r are indeed the members of the covers (at successive resolutions) close in the £oo sense to some <fi € 3>t- 
It is easy to see that w( s,r ) is well-defined. Let the set of trees Wj be defined as 



W. 



{w^:l<a<|^|,l<r<|^-i|} 



Using the above notations we see that 



sup G[ I^ e ^ t (f f ( e ), Xi (e)) J <0 N + sup \J2 ^E e *( v ^ e ]t- V ^ £ ]r 1 ) 1 
^e*T V *=i / ^ e * T (j=i 1 t=i J 



3=1 



sup 



t=l 



Now before we proceed note that any w J £ Wj is such that for any t € [T] and any e € {±1} T , ||w| (e) || < 3/3j. 
Hence we see that Wj consists of Yj-valued trees, where Yj = {x : \\x\\ < 3/3^}. Hence 

/ t \ N T 

supG ^ e 4(f f ( e ),x f ( e )) <fe + ^ sup =^e t w^(e) 
0e* T \ J t=1 ) pi^ew^ t=1 



iV 



< p N sup 

3=1 y ' 



1 T 



(29) 



where the supremum is over Yj -valued trees. 

In the remainder of the proof we will use the shorthand A r 00 (/3) = A r 00 (/?, $t, T) and will use the constant 
c = 128. By Lemma [35J for any 9 > Sea 1 / 13 log 3/2 T/T 1 - 1 ^, we have 



By the union bound, 

T 



P e sup 
and so 



> 3 Ti- P Jp + S^yiogAUft)) < 2 A4o(/?i) exp 



T 2-2/ P 2 logTV^) 

2c 2 ct 2 /p log 3 T 



3j G [iV], sup 



Hence, 

/ JV 



E^up 



y E e *y* ( £ ) > ~~7ef + 39 E AViog^oc(ft) < 2 E^(ft) ex p s - 



j=i 



T^^g 2 log A/^ (ft ) 
2c 2 (t 2 /p log 3 T 



GO 



Using the above with Equation ( 29 1 gives us that 



sup G h=$>^ t (f t (e),x t (e)) > ^^+/3 Ar + 3^/3 J JlogAA 00 (/3 3 ) 



r r ^ 02logA/;o(ft) 



exp 



2c 2 ct 2 /p log 3 T 



<2 ^expilogA^(ft) (l y 



./ = ! 



2c 2 ct 2 /p log" 3 T 



Our assumption on 8 implies that 4e T . 2 /p lo 8 3 T > 2, so that 



< 2 £ exp r_T^io g ^ (A .) 



4c 2 ct 2 /p log 3 T 



AT 



< 2 exp J T % }> V ATM)' 1 

l \ 4c 2 a 2 /flog 3 Tj ^ 

Since we have assumed that 2 XwLx^oo(/%) _1 < L, we see that 

/ i T \ fi X/p W \ f T ^Vq2 

Using the arguments employed previously, picking N appropriately and bounding sum by integral we have 
that 

AT 



Hence we conclude that 



30 J2 Pi yJfo&MooWi) < mf 1 4a + 360 J y/ log (5) d6^ 



□ 



Proof of Theorem 38 Let a > be a constant that we will fix later. Consider a "subgaussian game" 
whose value is defined as: 

V| g (A<&t) =infsup E ...infsup E T ( sup {B(t(f u x x ), . . . ,£(f T , x T )) - B{1^ (J u Xl ), . . . ,£* T (f T , xr))} ) 
ii Xl fi~m it x t it~<it We* T J 

(30) 
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where 



T(x) := sup exp(aT9 2 /k)l{x > 9} = exp(aTx 2 /k) . 



Here, we are using the intuition that we expect to find a player strategy using which the regret will have 
subgaussian tails. As before, we consider the calibration setting described in Example [4] augmented with 
the restriction that the player's choice belongs to Cg, a 2<5-maximal packing of A(fc), instead of A(fe). The 
choice of 6 will be fixed later. We now apply the general triplex inequality in Appendix [B] with 

A(x) := sup exp(aT6» 2 /fc)l {x > 9/3} = exp(9aTx 2 /k) . 

e 

Observe that the first term in the General Triplex Inequality is simply equal to 1. The second term is upper 
bounded by a particular (sub)optimal response q t being the point mass on pf , the element of Cg closest to 
p t . Note that any 28 packing is also a 28 cover. Thus, the second term becomes 



sup inf . . . sup inf A sup (<Ji,Pi), . . . , ^ T (qT,Pr))} 



in 



Pt 



'It 



< sup . . . sup A sup sup 

Pi Pt V A>0pGA(fe) 

= sup . . . sup A sup sup 

Pi Pt V A>0pGA(fe) 

^2 



t=l 

l^l{||^-p||<A}-(pf- P i) ) 
t=l / 



< A (8) = exp(9a^7fc) . 
By the same reasoning as used in the previous proof, the third term 



sup Ed 

D 



A 



( s p u P ^E^H/t-pll < \} (ft -Xt)-Et-i[l{\\ ft ~P\\ <\}(ft-xt)}) j 



can be bounded by 



sup Ed 

D 



A max 

\ (p,a)gs 



i 2 (1 {\\ft ~ P\\ < A} (ft - »t) - Et-i [1 {\\ft - P\\ < A} (ft - xt)]) 



where S is a finite set of cardinality l^l < | C5 1 cfe . Since A is non-decreasing and maximum of positive 
quantities is bounded by their sum, we have the upper bound 



sup Ed 



A 



(A,p)es 
< |5| • M A 
where M\ is defined as 



1 T 

-J2(l{\\ft- P\\ < A} (ft - x t ) - E t _! [1 {||/ t - p|| < A} (f t - x t )]) 



A/ A 



sup E 

MDS 



A 



Here the supremum is over all martingale difference sequences X\, . . . ,Xt with ||A t ||i < 2/T almost surely. 
Since we are considering the case when || • || = || • ||i, we have 



M A = sup E 

MDS 



< sup E 

MDS 







T 


exp 


9aT 


E^ 






4=1 






T 


exp 


9aT 








*=i 




G2 



Using Corollary 45 we have 







T 




E 


exp ^9aT 




3] 











< e 

< e 

< e 



9aT 



0>e 



2 exp 



6»e 



E^ 

t=l 

iog(gn 

288a / 



> \d6 



2 

rf0 



— d6» < e + 2 < 5 

8>e V 



where we chose a = 1/576 to make 288a = 1/2. This shows that Ma < 5 and hence the third term is 
bounded by 515*1. 

Now putting the upper bounds on the three triplex inequality terms together, we get that 



V|^,* T )<l + exp^— j+5 



Choose 5 = yfkJT to get 



V| G (£,$ T ) < 3 + 5 



Using Markov's inequality now shows that there is a player strategy such that against any adversary and 
any 9 > 0, we have 

P(R T > 6) < 8T ck / 2 exp ' 



576k J ■ 

Equivalently, for the same player strategy, against any adversary and any r\ £ (0, 1), we have with probability 
at least 1 — r/, 

24 



Rt < 



T 



'fclog ( - 



ck 4 



log(T) 



(31) 



Finally to show almost sure convergence we need to use a "doubling trick" similar to the one used in |22) . 
We divide time into episodes r = 1,2,... with episode r of length 2 r . In episode r, the player plays 
the optimal strategy for the subgaussian game of length 2 r . Thus, episode r lasts during the time steps 
E r = {2 r — 1, ... , 2 r+1 — 2}. Now fix any adversary for the infinite round game and let us focus on the regret 
incurred at some time T. We have, 



Rt = sup sup 

A>0peA(fe) 

riog 2 (T)l 



< 



1 '--°*v- 

- y 

T A>0peA(fe) 
riog 2 (T)l 



< 



l f yi{\\f t - p \\<X}.{f t - Xt ) 
E l{||/*-p||< A}- 



T 



E 2r 



sup sup 
eA 

24 



t£E r 



'fclog 



ck A 



log(2-) 



with probability at least 1 — X)r<iog 2 (T) 7 lT,r- In the last step we used (31) along with a union bound over 
episodes. Choosing r\T t r = 1/T 2 2 r ensures that with probability at least 1 — l/T 2 , we have 



R T < 24(1 + a/2) 



fclog(8T 3 ) + ^log(T) 



T 
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24(1 + y/2) < 60, using Borel-Cantelli, this shows that 



— • Rt > 60 infinitely often | = 
/3fclog(2T) + ^log(T) 

This proves the theorem. □ 

A Concentration of 2-Smooth Functions of Martingale-Difference 
Sums in Banach Spaces 

In this section we prove an extension of some of the results of Pinelis [53]. Let (H,\\ ■ ||) be a separable 
Banach space such that there is a function G : H — > K with the following properties: 

G(0) = 

|G(v + w) - G(v)| < || w|| (Lipschitz) 

(G 2 )"(v)[w,w] < cr||w|| 2 (G 2 is (cr, 2)-smooth) 

Suppose we have an "H-valued MDS {X t }f =1 . Define the partial sums So = 0, S t = J2 s <t X t f° r t > 0. 
Define, for t > 0, 

Z t = cosh(AG(5 t )) 

The following lemma is embedded in proof of Theorem 3.2 in Pinelis. Assume a > 1 for simplicity. Otherwise, 
everything below works by replacing a with maxjer, 1}. 

Lemma 44. Suppose \\X t \\ < B a.s. and fix A > 0. Then Zt/c is a supermartingale where 

c = 1 + cr(exp(AB) — 1 — XB) . 

In particular, we have 

E [Z T ] < c T . 

Proof. The key step is to define a scalar function <j> : [0, 1] — > R: 

(j)(a) := Et-i [cosh(AG(^ t _i + aX t ))] . 

Note that 0(1) = E t _i [Z t ] and 0(0) = Z t -i, so our goal is to prove 0(1) < c • 0(0). We compute the first 
two derivatives of 0, 

0'(a)=E t _i sinh(A 5St _ 1 ,x t (a)) • Ws t - U x t ( a ) > 

0"(a) - Et_! [cosh(A 9 s t _ lA (a)) • (A 5 ^ t _ liXt (a)) 2 ] (32) 
+ E t _! [sinh(\g St _ uXt (a)) ■ Xgl_ uXt (a)] , (33) 

where, for any S 1 , X £ H, we define gs,x( a ) = G(S + aX). Note that 

g's, x ( a ) = G '( s + aX )( x ) > 

glx(a) = G"(S + aX)(X,X) . 

Now, consider two cases. 
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Case 1: sign(Xgs t _ lt Xt i a )) = s ^& n (g's t ^ 1 .x t ^ n this case, we use the fact that sign(sinh(a;)) = sign(x cosh(x) 
and that | sinh(x)| < |xcosh(a;)|, to obtain the upper bound 

cosh(Xg St _ uXt (a)) ■ (Xg' StiXt {a)) 2 + smh{Xg St _ uXt (a)) ■ \g's t _ uXt ( a ) 
< cosh(A3s t _ 1 ,x t (a)) • (Xg' St _ u x t ( a )) 2 + cosh ( A 5s t _i,x t (a)) • Aff St _ 1)Xt (a) • Ws t _ u x t ( a ) 
= X 2 ■ cosh{Xg St _ uXt (a)) ■ {g 2 St _ u x T )" [a) 
<aX 2 B 2 ■cosh(Xg St _ uXt (a)) , 
because {g% t _ uXT )" {a) = G"{S t _ x + aX t )(X t , X t ) < a\\X t \\ 2 < oB 2 . 

Case 2: sign(Aps t _ li x t ( a )) 7^ s ig n (<?s t _i x ^ n this case, we simply have, 

cosh(Ag St _ liXt (a)) ■ (Aff5 t _ 1 A ( a )) 2 + siah(Xgs t _ ltXt (a)) ■ Xg's tuXt (a) 

< cosh(Xg St _ u x t (a)) ■ (Xg' St _ uXt (a)) 2 

< X 2 B 2 ■cosh(A< ?St _ 1)X ») > 
because, by Lipschitz property of G, we have 

Ws t - u xM\ = \G'(S t ^ + aX t )(X t )\ < ||Gf (St_! + aX t )\U ■ \\X t \\ <1-B. 

Thus, we always have, 

cosh(Ac?s t _ 1>Xt (a)) • (Xg' St _ uXt (a)) 2 + smh(Xg St _ uXt { a )) ' Ws t _ 1 ,x t ( ') < ctA 2 B 2 • cosh(Ag St _ 1 , Xt (a)) ■ 
Plugging this into p3| , we get 

< CTA 2 B 2 E t _! [cosh(AG(5 t _i + aX t ))] 

< aX 2 B 2 E t _ 1 [cash(AG(5 t _i) + Aa||X t ||)] 

< aX 2 B 2 E t _ 1 [cosh(AG(5 t _i)) • exp(Aa||X t ||)] 

< ctA 2 B 2 • cosh(AG(5 t _i)) • exp(AaS) 
= ctA 2 B 2 • Z t -i ■ cxp(XaB) . 

Note that (f>'(0) = E t _i [G'(S t -i)(X t )] = G'(5 i _i)(E t _ 1 [X t ]) = by the MDS property. Thus, 

<t>'W) = [ 4"(y)dy 

Jy=0 

and therefore 

z t = 0(1) = ^(o) + / 4>'{p)dp 

= Z t -! + [ f <p"(y)dydp 

J (3=0 J y=0 

= Z t -i+ [ f cj>"(y)df3dy 
J y =0 Jf3=y 



Z t - 1+ / <j>"{y){l-y)dy 

Jy=0 



< Z t -, ■ (l + aX 2 B 2 J exp(XBy)(l - y)dy 



Z T _i • (1 + a(exp(XB) - 1 - XB)) 

□ 
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Now that we have control over E [cosh(AG(Sr))], the following control on m.g.f. is immediate. 
Corollary 45. Under the same conditions as previous lemma, 

E [cxp(AG(S T ))] <2c T . 

Moreover, 



P(G(S T ) > e) < 2cxp 



' ATaB 2 ) 



whenever T > e/ (2aB). 



Proof. The first inequality follows by noting that cosh(x) = (cxp(x) + exp(— x))/2 > cxp(x)/2. 
For the second inequality 

P(G(S T ) > e) = P(cxp(AG(S T )) > exp(Ae)) 

< cxp(-Ae)E [cxp(AG(S T ))] 

< 2cxp(-Ae)(l + cr(cxp(AS) - f - XB)) T 

< 2exp{-Ae + Tlog(l + <r(exp(AB) - 1 - XB))} 

< 2 exp {-Ae + Ter(cxp(AB) - 1 - \B)} 

< 2cxp{-\e + Ta\ 2 B 2 } 

where the last inequality is valid for any A < 1/B. Optimizing over A, we let 

A: ' 



2TaB 2 ' 

which yields the desired upper bound. The condition A < 1/B is satisfied whenever T > e/(2aB). 



□ 



With control on the m.g.f., a Massart style union bound argument at the level of expectations is immediate. 

Theorem 46. Suppose {X]}J =0 is a family of MDS indexed by 7 in some finite set V. Suppose for each 
j,t, \\X?\\ < B a.s. Then, we have, for any T > log(2|r|)/er 7 



E 



maxG(Sj) 

7GT 



where 5j = J2t=i ■ 
Proof. Fix A > 0. Then, 

exp ( AE 



maxG(S'J) 
7er 



< 2B log(2|r|)T 



< E 
= E 

< E 



exp(A max G(Sl)) 
-yer 

maxexp(AG(S'J)) 
7er 



]Tcxp(AG(Sj)) 

7er 

< 2|r| • (1 + cr(cxp(AB) - 1 - \B)) T . 
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Taking logs and dividing by A gives, 



E 



maxG(Sj) 
7er 



< 



< 



< 



log(2|r|) + Tlog(l + a(exp(Ag) - 1 - XB)) 
X 

log(2|r|) + Ta(exp(XB) - 1 - XB) 
A 

log(2|r|) + TgA 2 F 2 
A ' 



where the last inequality is valid for any A < 1/B. Optimizing over A, we choose A = v /log(2|r|)/TcrS 2 
which is less than 1/B under the condition T > log(2|r|/er). Plugging this in gives, 



E 



maxG(S^) 

7GT 



< 2B log(2|r|)T 



Lemma 47. If F is a non-negative real-valued random variable and¥(F > e) < 2 exp | — 1 , then 

EF < v/2ttc/T. 

More generally, ifP(F > a + e) < 27V exp {-^} f or e > \f~ 



'41og(2JV) 



then 



EF <a 



(Vlog(2JV) 



□ 



Proof. 



p oo poo ( 

EF = J P(F > e)de < 2 J cxp |- 



Te 2 1 /27TC 1 f°° r a , , 



2ttc 



For the second statement, 



/•OO />00 

EF = ¥(F > a + e)de <a + x+ / P(F > a + e)de. 

JO Jx 



Choose x = J il °s( 2N \ For e > x, it holds that + log(27V) < Thus, 



EF < 



41og(2iV) 



exp 



' be 2 ] _ _ /41og(2iV) ( /4tt 1 



6 V2ir 



pOO 

I exp{-it 2 /2}cfo. 
Jo 



□ 



B A General Triplex Inequality 

Here we make the observation that the two versions of the triplex inequality, namely the expected (Theorem[T]) 
and high probability (Theorem 27) versions, are special cases of a general triplex inequality which bounds 
the value of a 'T-game" defined as: 



V$(l,9 T ) = mfsup E ...infsup E r sap {B(f(/ ll i 1 ),..,l(/ T ,i T ))-B(^ 1 (/ 1 ,x 1 ),...,V(/ Tl i T ))} 



(34) 



G7 



The expectation and high-probability games are recovered by choosing T(x) = x and T(x) = 1 {x > 8} 
respectively. We now state and prove the general triplex inequality 

Theorem 48 (General Triplex Inequality). If T satisfies 

r(x + y + z) < A(x) + A{y) + A(z) 

for some A : R — > R, then we have, 

V?(l, $ T ) < sup E D [A (B(£(fx,xx), £(f T , x T )) - B(£( qi , P x), . . . , £{q T ,p T )))] 



sup inf . . . sup inf A sup {B(£(qx,px), . . .,£(q T ,p T )) - B{£^ (qx,Px), ■ . ■ , ^ T {qr , Pt))} 

Pi «i PT 9r \0G*r , 



sup Ed 
D 



A sup {B{l^ x {qx,px), . . .,l^ T {qT,Pr)) - ), . . . ,£^ T (f T ,x T ))} 

\4>£'S> T 



where D ranges over distributions over sequences (xx, fx), ■ ■ ■ , (%t, /t)- 
Proof. The value of the game V^p(£, $t), defined in (34), is 

v T {£,$ T ) 



inf sup E ... inf sup E 

11 P1 /i~9i It Pt /t~<?t 



= sup inf E ... sup inf E 

pi 11 



p T <?t It~it 



r( sup {B(l(/i,n) /(/ Tl *r)) - ^(/t^t))} 



r ( sup {B(£(fx, xx), l(J T ,xr)) - B(£^ (f x ,x x ), ■ ■ ■ ,£<j, T (f T , x T ))} 



via an application of the minimax theorem. Adding and subtracting terms to the expression above leads to 
Vt(£,&t) =supinf E ...supinf E [T (B{£(fx, xx), . . . ,t(f T , x T )) - B(£{q ljPl ), . . . ,£{q T ,Pr)) 

pi 91 /l~<2l p T IT fT~QT 

+ sup {B(£(qx,pi),...,£(q T ,PT)) - B(£ 4 , 1 (fx,xx),..-,£4, T (fT,x T ))} 

<supinf E ...supinf E [T (B(£(fx, xx), . . . ,t{h, *t)) - B(£{q ljPl ), . . . , £{q T ,Pr)) 
Pl qi h~qi PT qT fT~qT 

^1~P1 x T ~p T 

+ sup \B(£(q 1 , Pl ),...,£(q T , P T))- B(£ < j, 1 (q 1 ,p 1 ),...,£ t/ , T (qT, P T))} 

+ sup {B^iqx^i), . . .,U T (qr,pr)) - B(£^ (/i, xi), . . . , (f T , x T ))} 
<supinf E ...supinf E [A (B(£(fx, xx), ■ ■ ■ , £{fr, x T )) - B(£(q 1)Pl ), . . . ,£{q T ,Pr))) 

p 1 11 /i~9i PT It 1t~1t 



+A I sup 



[B(£(qx,px), . . .,£(q T ,p T )) - B^qx^i), . . . ,U T (qT,p T ))}j 



+A ( sup {B(£ (j>1 (q 1 ,p 1 ), . . . ,e^ T (q T , P T)) ~ B{£ <t , 1 (f 1 ,x 1 ), . . . , £$ T (fr, x T ))} 
<#>e« T 



4 To be precise, the expectation version of the Triplex inequality presented in Theorem[T]is slightly different, as the expectation 
is taken outside of B. Modulo this difference, the proofs are identical. 
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At this point, we would like to break up the expression into three terms. To do so, notice that expectation 
is linear and sup is a convex function, while for the infimum, 



inf [Ci(o) + C 2 (a) + C 3 (a)} < 



sup Ci (a) 



infC 2 (o) 



supC 3 (a) 



for functions Ci, C2, C3. We use these properties of inf, sup, and expectation, starting from the inside of the 
nested expression and splitting the expression in three parts. We arrive at 

Vf(i,$T) 

<supsup E ...supsup E [A(B(£(f 1 ,x 1 ),...J(f T ,x T ))~ B(e( qi , Pl ),...,£(q T ,p T )))} 

p 1 qi /l~8l p T q T fr^lT 

+ sup inf E ... sup inf E 

pj 91 p T IT /t~9t 



A( sup {B(£(q 1 ,p 1 ),...,l(q T ,p T ))-B(£^ 1 (qi,pi),...,i <l>T (q T ,p T ))} 



+ sup sup E ... sup sup E 

PI 11 p T q T /t~9t 



A ( sup {B(£^ >1 (q 1 ,p 1 ), . . .,U T (q T ,Pr)) - B{1^ xi), . . . ,l</, T (fT,x T ))} 



As mentioned in the corresponding proof of Theorem [Tj the replacement of infima by suprema in the first 
and third terms appears to be a loose step and, indeed, one can pick a particular response strategy {ql } 
instead of passing to the supremum. 

Consider the second term in the above decomposition. Clearly, 



sup inf E , - 

n-t f* r^n, _ q-p Jj,r^jqrp 



Pl 91 /l~9l 
"1~P1 



sup inf E 



Pt 



A sup B(£(q 



<i,Pi), . . .,t(qT,Pr)) ~ B iUi(9i,Pi), ■ ■ ■ J<Pt(Qt,Pt)) 



= sup inf ... sup inf A sup B(%i,pi), . . . ,£{qT,Pr)) ~ B{1^ (qi,pi), . . . ,l<j, T (#t,Pt)) 

Pl 1i p T It \^>g*t j 

because the objective does not depend on the random draws. 



□ 
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